Sie sind auf Seite 1von 131

Informatica Big Data Management

(Version 10.1)

Installation and Configuration


Guide
Informatica Big Data Management Installation and Configuration Guide

Version 10.1
June 2016
© Copyright Informatica LLC 2014, 2016

This software and documentation contain proprietary information of Informatica LLC and are provided under a license agreement containing restrictions on use and
disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any
form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. This Software may be protected by U.S. and/or
international Patents and other Patents Pending.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as
provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013©(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14
(ALT III), as applicable.

The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us
in writing.
Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,
PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica
On Demand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging,
Informatica Master Data Management, and Live Data Map are trademarks or registered trademarks of Informatica LLC in the United States and in jurisdictions
throughout the world. All other company and product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rights
reserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rights
reserved. Copyright © Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © Meta
Integration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe Systems
Incorporated. All rights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. All
rights reserved. Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rights
reserved. Copyright © Glyph & Cog, LLC. All rights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rights
reserved. Copyright © Information Builders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved.
Copyright Cleo Communications, Inc. All rights reserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-
technologies GmbH. All rights reserved. Copyright © Jaspersoft Corporation. All rights reserved. Copyright © International Business Machines Corporation. All rights
reserved. Copyright © yWorks GmbH. All rights reserved. Copyright © Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved.
Copyright © Daniel Veillard. All rights reserved. Copyright © Unicode, Inc. Copyright IBM Corp. All rights reserved. Copyright © MicroQuill Software Publishing, Inc. All
rights reserved. Copyright © PassMark Software Pty Ltd. All rights reserved. Copyright © LogiXML, Inc. All rights reserved. Copyright © 2003-2010 Lorenzi Davide, All
rights reserved. Copyright © Red Hat, Inc. All rights reserved. Copyright © The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Copyright
© EMC Corporation. All rights reserved. Copyright © Flexera Software. All rights reserved. Copyright © Jinfonet Software. All rights reserved. Copyright © Apple Inc. All
rights reserved. Copyright © Telerik Inc. All rights reserved. Copyright © BEA Systems. All rights reserved. Copyright © PDFlib GmbH. All rights reserved. Copyright ©
Orientation in Objects GmbH. All rights reserved. Copyright © Tanuki Software, Ltd. All rights reserved. Copyright © Ricebridge. All rights reserved. Copyright © Sencha,
Inc. All rights reserved. Copyright © Scalable Systems, Inc. All rights reserved. Copyright © jQWidgets. All rights reserved. Copyright © Tableau Software, Inc. All rights
reserved. Copyright© MaxMind, Inc. All Rights Reserved. Copyright © TMate Software s.r.o. All rights reserved. Copyright © MapR Technologies Inc. All rights reserved.
Copyright © Amazon Corporate LLC. All rights reserved. Copyright © Highsoft. All rights reserved. Copyright © Python Software Foundation. All rights reserved.
Copyright © BeOpen.com. All rights reserved. Copyright © CNRI. All rights reserved.

This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and/or other software which is licensed under various versions
of the Apache License (the "License"). You may obtain a copy of these Licenses at http://www.apache.org/licenses/. Unless required by applicable law or agreed to in
writing, software distributed under these Licenses is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the Licenses for the specific language governing permissions and limitations under the Licenses.

This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software
copyright © 1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under various versions of the GNU Lesser General Public License
Agreement, which may be found at http:// www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any
kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.

The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California,
Irvine, and Vanderbilt University, Copyright (©) 1993-2006, all rights reserved.

This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and
redistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.

This product includes Curl software which is Copyright 1996-2013, Daniel Stenberg, <daniel@haxx.se>. All Rights Reserved. Permissions and limitations regarding this
software are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or
without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

The product includes software copyright 2001-2005 (©) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms
available at http://www.dom4j.org/ license.html.

The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to
terms available at http://dojotoolkit.org/license.

This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations
regarding this software are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.

This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at
http:// www.gnu.org/software/ kawa/Software-License.html.

This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & Wireless
Deutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.

This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are
subject to terms available at http:/ /www.boost.org/LICENSE_1_0.txt.

This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at
http:// www.pcre.org/license.txt.

This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms
available at http:// www.eclipse.org/org/documents/epl-v10.php and at http://www.eclipse.org/org/documents/edl-v10.php.
This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://
www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://
httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/
license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-
agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html;
http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt; http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/
2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://
forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://
www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://
www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/
license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http://
www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/CreateJS/EaselJS/blob/master/src/easeljs/display/Bitmap.js;
http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/LICENSE; http://jdbc.postgresql.org/license.html; http://
protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/LICENSE; http://web.mit.edu/Kerberos/krb5-
current/doc/mitK5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/LICENSE; https://github.com/hjiang/jsonxx/
blob/master/LICENSE; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/LICENSE; http://one-jar.sourceforge.net/index.php?
page=documents&file=license; https://github.com/EsotericSoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/
blueprints/blob/master/LICENSE.txt; http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html; https://aws.amazon.com/asl/; https://github.com/
twbs/bootstrap/blob/master/LICENSE; https://sourceforge.net/p/xmlunit/code/HEAD/tree/trunk/LICENSE.txt; https://github.com/documentcloud/underscore-contrib/blob/
master/LICENSE, and https://github.com/apache/hbase/blob/master/LICENSE.txt.
This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and Distribution
License (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License
Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/
licenses/BSD-3-Clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artistic-
license-1.0) and the Initial Developer’s Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/).

This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this
software are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab.
For further information please visit http://www.extreme.indiana.edu/.

This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subject
to terms of the MIT license.

See patents at https://www.informatica.com/legal/patents.html.

DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied
warranties of noninfringement, merchantability, or use for a particular purpose. Informatica LLC does not warrant that this software or documentation is error free. The
information provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is
subject to change at any time without notice.

NOTICES

This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress Software
Corporation ("DataDirect") which are subject to the following terms and conditions:

1. THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT
INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT
LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

Part Number: IN-BDE-101-0001


Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Product Availability Matrixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Marketplace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 1: Big Data Management Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


Installation Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Informatica Big Data Management Installation Process. . . . . . . . . . . . . . . . . . . . . . . . . . 10
Before You Begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Install and Configure the Informatica Domain and Clients. . . . . . . . . . . . . . . . . . . . . . . . . 11
Install and Configure PowerExchange Adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Install and Configure Data Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Pre-Installation Tasks for a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Pre-Installation Tasks for a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Big Data Management Installation from an RPM Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Download the Distribution Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Installing in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Installing in a Cluster Environment from the Primary Name Node Using SCP Protocol. . . . . . 15
Installing Big Data Management Using Another Protocol. . . . . . . . . . . . . . . . . . . . . . . . . 16
Installing in a Cluster Environment from a Non-Name Node Machine. . . . . . . . . . . . . . . . . . 16
Big Data Management Installation from a Debian Package. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Download the Debian Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Installing Big Data Management in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . 18
Installing Big Data Management Using the SCP Protocol. . . . . . . . . . . . . . . . . . . . . . . . . 18
Installing Big Data Management Using Another Protocol. . . . . . . . . . . . . . . . . . . . . . . . . 18
Installing Big Data Management in a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . 19
Big Data Management Installation from a Cloudera Parcel Package . . . . . . . . . . . . . . . . . . . . . 19
Installing Big Data Management Using Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . 19
Informatica Big Data Management Uninstallation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Uninstalling Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Uninstalling Big Data Management on Cloudera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 2: Post-Installation Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


Post-Installation Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Reference Data Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Table of Contents
Installing the Address Reference Data Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Update Hadoop Cluster Configuration Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Enable Developer Tool Communication with the Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . 25
Enable Support for Lookup Transformations with Teradata Data Objects. . . . . . . . . . . . . . . . . . 25
Big Data Management Configuration Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Use Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Use Apache Ambari. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Use SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Download the JDBC Driver JAR Files for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . 33
Add Hadoop Environment Variable Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Configure Run-time Engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Blaze Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Spark Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Hive Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop


Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Mappings on Hadoop Distributions Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Enable Mappings in a Hadoop Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Configure Hive Variables for Mappings in a Hadoop Environment. . . . . . . . . . . . . . . . . . . . 42
Configure Environment Variables in the Big Data Management Hadoop Environment
Properties File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Configuring Library Path and Path Variables for Mappings in a Hadoop Environment. . . . . . . 43
Configuring Big Data Management in the Amazon EMR Environment. . . . . . . . . . . . . . . . . . . . 43
Install the EBF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Domain Configuration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Cluster Configuration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Configuring Big Data Management in the Azure HDInsight Cloud Environment. . . . . . . . . . . . . . 47
Prerequisites to Configure Big Data Management for Azure HDInsight. . . . . . . . . . . . . . . . . 48
Perform Post-Installation Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Populate the HDFS File System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Configure the Hadoop Pushdown Properties for the Data Integration Service. . . . . . . . . . . . 49
Edit Informatica Developer Files and Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Configure Environment Variables in the Big Data Management Hadoop Environment
Properties File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Configure Hadoop Cluster Properties on the Data Integration Service. . . . . . . . . . . . . . . . . 50
Configure Big Data Management on the HDInsight Cluster. . . . . . . . . . . . . . . . . . . . . . . . 53
Configure Mappings to Run on the Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Configure and Start Informatica Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Configure Settings on the Informatica Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Configuring Big Data Management in the Cloudera Environment. . . . . . . . . . . . . . . . . . . . . . . 73
Configure Hadoop Cluster Properties on the Data Integration Service Machine. . . . . . . . . . . 74

Table of Contents 5
Create a Staging Directory on HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Configure Virtual Memory Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Add hbase_protocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Configure the HiveServer2 Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Configure HiveServer2 for DB2 Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Disable SQL Standard Based Authorization for HiveServer2. . . . . . . . . . . . . . . . . . . . . . . 80
Configuring Big Data Management in the Hortonworks HDP Environment. . . . . . . . . . . . . . . . . 80
Configure Hadoop Cluster Properties for the Data Integration Service. . . . . . . . . . . . . . . . . 81
Configure the Mapping Logic Pushdown Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Add hbase_protocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
HiveServer 2 Configuration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Update Cluster Configuration Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Configuring Big Data Management in the IBM BigInsights Environment. . . . . . . . . . . . . . . . . . . 93
User Account for the JDBC and Hive Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Enable Support for Data Quality Capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Create the HiveServer2 Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Enable Support for HBase with HiveServer2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Configuring Big Data Management in the MapR Environment. . . . . . . . . . . . . . . . . . . . . . . . . 97
Verify the Cluster Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Install the EBF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Configure the Informatica Domain to Communicate with a Kerberos-Enabled MapR 5.1
Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Configure Run-time Engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Configure the Informatica Domain to Communicate with a Cluster that Uses MapR Ticket
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Configure Hive and HDFS Metadata Fetch for MapR Ticket or Kerberos. . . . . . . . . . . . . . . 106
Running Mappings Using the Teradata Connector for Hadoop on a Hive or Blaze Engine. . . 107
Configure Environment Variables for MapR 5.1 in the Hadoop Environment Properties File. . 107
Configure Hadoop Cluster Properties on the Data Integration Service Machine for
MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Configure yarn-site.xml for MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Edit warden.conf to Configure Heap Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Chapter 4: High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112


Configure High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Configuring Big Data Management for a Highly Available Cloudera CDH Cluster. . . . . . . . . . . . 113
Enable Support for a Highly Available Hortonworks HDP Cluster. . . . . . . . . . . . . . . . . . . . . . 114
Configure Cluster Properties for a Highly Available Name Node. . . . . . . . . . . . . . . . . . . . 114
Configure Cluster Properties for a Highly Available Resource Manager. . . . . . . . . . . . . . . 116

6 Table of Contents
Configuring Big Data Management for a Highly Available Hortonworks HDP Cluster. . . . . . . 118
Configuring Big Data Management for a Highly Available IBM BigInsights Cluster. . . . . . . . . . . 119
Configuring Informatica for Highly Available MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Appendix A: Upgrade Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


Upgrading Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Appendix B: Configure Ports for Big Data Management. . . . . . . . . . . . . . . . . . . . . . . 122


Informatica Domain and Application Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Application Services and Ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Big Data Management Ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Ports for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Informatica Developer Ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Table of Contents 7
Preface
The Informatica Big Data Management Installation and Configuration Guide is written for the system
administrator who is responsible for installing Informatica Big Data Management. This guide assumes you
have knowledge of operating systems, relational database concepts, and the database engines, flat files, or
mainframe systems in your environment. This guide also assumes you are familiar with the interface
requirements for the Hadoop environment.

Informatica Resources

Informatica Network
Informatica Network hosts Informatica Global Customer Support, the Informatica Knowledge Base, and other
product resources. To access Informatica Network, visit https://network.informatica.com.

As a member, you can:

• Access all of your Informatica resources in one place.


• Search the Knowledge Base for product resources, including documentation, FAQs, and best practices.
• View product availability information.
• Review your support cases.
• Find your local Informatica User Group Network and collaborate with your peers.

Informatica Knowledge Base


Use the Informatica Knowledge Base to search Informatica Network for product resources such as
documentation, how-to articles, best practices, and PAMs.

To access the Knowledge Base, visit https://kb.informatica.com. If you have questions, comments, or ideas
about the Knowledge Base, contact the Informatica Knowledge Base team at
KB_Feedback@informatica.com.

Informatica Documentation
To get the latest documentation for your product, browse the Informatica Knowledge Base at
https://kb.informatica.com/_layouts/ProductDocumentation/Page/ProductDocumentSearch.aspx.

If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation
team through email at infa_documentation@informatica.com.

8
Informatica Product Availability Matrixes
Product Availability Matrixes (PAMs) indicate the versions of operating systems, databases, and other types
of data sources and targets that a product release supports. If you are an Informatica Network member, you
can access PAMs at
https://network.informatica.com/community/informatica-network/product-availability-matrices.

Informatica Velocity
Informatica Velocity is a collection of tips and best practices developed by Informatica Professional Services.
Developed from the real-world experience of hundreds of data management projects, Informatica Velocity
represents the collective knowledge of our consultants who have worked with organizations from around the
world to plan, develop, deploy, and maintain successful data management solutions.

If you are an Informatica Network member, you can access Informatica Velocity resources at
http://velocity.informatica.com.

If you have questions, comments, or ideas about Informatica Velocity, contact Informatica Professional
Services at ips@informatica.com.

Informatica Marketplace
The Informatica Marketplace is a forum where you can find solutions that augment, extend, or enhance your
Informatica implementations. By leveraging any of the hundreds of solutions from Informatica developers and
partners, you can improve your productivity and speed up time to implementation on your projects. You can
access Informatica Marketplace at https://marketplace.informatica.com.

Informatica Global Customer Support


You can contact a Global Support Center by telephone or through Online Support on Informatica Network.

To find your local Informatica Global Customer Support telephone number, visit the Informatica website at the
following link: http://www.informatica.com/us/services-and-training/support-services/global-support-centers.

If you are an Informatica Network member, you can use Online Support at http://network.informatica.com.

Preface 9
CHAPTER 1

Big Data Management Installation


This chapter includes the following topics:

• Installation Overview, 10
• Before You Begin, 11
• Big Data Management Installation from an RPM Package, 14
• Big Data Management Installation from a Debian Package, 17
• Big Data Management Installation from a Cloudera Parcel Package , 19
• Informatica Big Data Management Uninstallation, 20

Installation Overview
The Informatica Big Data Management installation package includes the Data Integration Service, the Blaze
run-time engine, and adapter components. Depending on your Hadoop implementation, Informatica
distributes the package to the Hadoop cluster as one of the following package types:

Red Hat Package Manager (RPM)


To install Big Data Management on Amazon EMR, HortonWorks, IBM BigInsights, and MapR, the tar.gz
file includes an RPM package and the binary files that you need to run the Big Data Management
installation.

Debian package
To install Big Data Management on Ubuntu Hadoop distributions on Azure HDInsight, the tar.gz file
includes a Debian package and the binary files that you need to run the Big Data Management
installation.

Cloudera Parcel package


To install Big Data Management on Hadoop distributions on Cloudera, the parcel.tar file includes a
Cloudera Parcel package and the binary files that you need to run the Big Data Management installation.

After you complete the installation, you must configure the Informatica domain and the Hadoop cluster to
enable Informatica mappings to run on a Hadoop cluster.

Informatica Big Data Management Installation Process


You can install Big Data Management in a single node or cluster environment.

10
Installing in a Single Node Environment
You can install Big Data Management in a single node environment.

1. Extract the Big Data Management tar.gz file to the machine.


2. Install Big Data Management by running the installation shell script in a Linux environment.

Installing in a Cluster Environment


You can install Big Data Management in a cluster environment.

1. Extract the Big Data Management tar.gz file to a machine on the cluster.
2. Install Big Data Management by running the installation shell script in a Linux environment. You can
install Big Data Management from the primary name node or from any machine using the
HadoopDataNodes file.
Add the IP addresses or machine host names, one for each line, for each of the nodes in the Hadoop
cluster in the HadoopDataNodes file. During the Big Data Management installation, the installation shell
script picks up all of the nodes from the HadoopDataNodes file and copies the Big Data Management
binary files to the /<BigDataManagementInstallationDirectory>/Informatica directory on each of the
nodes.

Before You Begin


Before you begin the installation, install the Informatica components and PowerExchange adapters, and
perform the pre-installation tasks.

Install and Configure the Informatica Domain and Clients


Before you install Big Data Management, install and configure the Informatica domain and clients.

Run the Informatica services installation to configure the Informatica domain and create the Informatica
services. Run the Informatica client installation to install the Informatica client tools.

Install and Configure PowerExchange Adapters


Based on your business needs, install and configure Informatica adapters. Use Big Data Management with
Informatica adapters for access to sources and targets.

To run Informatica mappings in a Hadoop environment you must install and configure Informatica adapters.

You can use the following Informatica adapters as part of Big Data Management:

• PowerExchange for DataSift


• PowerExchange for Facebook
• PowerExchange for HBase
• PowerExchange for HDFS
• PowerExchange for Hive
• PowerExchange for LinkedIn

Before You Begin 11


• PowerExchange for Teradata Parallel Transporter API
• PowerExchange for Twitter
• PowerExchange for Web Content-Kapow Katalyst

For more information, see the PowerExchange adapter documentation.

Install and Configure Data Replication


To migrate data with minimal downtime and perform auditing and operational reporting functions, install and
configure Data Replication. For information, see the Informatica Data Replication User Guide.

Pre-Installation Tasks for a Single Node Environment


Before you begin the Big Data Management installation in a single node environment, perform the following
pre-installation tasks.

• Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. The Hadoop installation
should include a Hive data warehouse that is configured to use a non-embedded database as the
MetaStore. For more information, see the Apache website here: http://hadoop.apache.org.
• To perform both read and write operations in native mode, install the required third-party client software.
For example, install the Oracle client to connect to the Oracle database.
• Verify that the Big Data Management administrator user can run sudo commands or have user root
privileges.
• Verify that the temporary folder on the local node has at least 700 MB of disk space.
• Download the following file to the temporary folder: InformaticaHadoop-
<InformaticaForHadoopVersion>.tar.gz
• Extract the following file to the local node where you want to run the Big Data Management installation:
InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz

Pre-Installation Tasks for a Cluster Environment


Before you begin the Big Data Management installation in a cluster environment, perform the following tasks:

• Install third-party software.


• Verify the distribution method.
• Verify system requirements.
• Verify connection requirements.
• Download the RPM.

Install Third-Party Software


Verify that the following third-party software is installed:
Hadoop with Hadoop Distributed File System (HDFS) and MapReduce
Hadoop must be installed on every node within the cluster. The Hadoop installation must include a Hive
data warehouse that is configured to use a MySQL database as the MetaStore. You can configure Hive
to use a local or remote MetaStore server. For more information, see the Apache website here:
http://hadoop.apache.org/.

Note: Informatica does not support embedded MetaStore server setups.

12 Chapter 1: Big Data Management Installation


Database client software to perform read and write operations in native mode
Install the client software for the database. Informatica requires the client software to run MapReduce
jobs. For example, install the Oracle client to connect to the Oracle database. Install the database client
software on all the nodes within the Hadoop cluster.

Verify the Distribution Method


You can distribute Big Data Management to the Hadoop cluster with one of the following protocols:

• File Transfer Protocol (FTP)


• Hypertext Transfer Protocol (HTTP)
• Network File System (NFS) protocol
• Secure Copy (SCP) protocol
• Cloudera Manager.

To verify that you can distribute Big Data Management to the Hadoop cluster with one of the protocols,
perform the following tasks:

Note: If you use Cloudera Manager to distribute Big Data Management to the Hadoop cluster, skip these
tasks.

1. Ensure that the server or service for your distribution method is running.
2. In the config file on the machine where you want to run the Big Data Management installation, set the
DISTRIBUTOR_NODE parameter to the following setting:
• FTP: Set DISTRIBUTOR_NODE=ftp://<Distributor Node IP Address>/pub
• HTTP: Set DISTRIBUTOR_NODE=http://<Distributor Node IP Address>
• NFS: Set DISTRIBUTOR_NODE=<Shared file location on the node.>
The file location must be accessible to all nodes in the cluster.

Verify System Requirements


Verify the following system requirements:

• The Big Data Management administrator can run sudo commands or has root user privileges.
• The temporary folder in each of the nodes on which Big Data Management will be installed has at least
700 MB of disk space.

Verify Connection Requirements


Verify the connection to the Hadoop cluster nodes.

Big Data Management requires a Secure Shell (SSH) connection without a password between the machine
where you want to run the Big Data Management installation and all the nodes in the Hadoop cluster.

Before You Begin 13


Big Data Management Installation from an RPM
Package
To install Big Data Management on Amazon EMR, HortonWorks, IBM BigInsights, and MapR, download the
tar.gz file that includes an RPM package and the binary files that you need.

You can install Big Data Management in a single node environment. You can also install Big Data
Management in a cluster environment from the primary name node or from any machine.

Install Big Data Management in a single node environment or cluster environment:

• Install Big Data Management in a single node environment.


• Install Big Data Management in a cluster environment from the primary name node using SCP protocol.
• Install Big Data Management in a cluster environment from the primary name node using FTP, HTTP, or
NFS protocol.
• Install Big Data Management in a cluster environment from any machine.
• Install Big Data Management from a shell command line.

Download the Distribution Package


1. Download the following file to a temporary folder: InformaticaHadoop-
<InformaticaForHadoopVersion>.tar.gz.
Note: The distribution package must be stored on a local disk and not on HDFS.
2. Extract the file to the machine from where you want to distribute the package and run the Big Data
Management installation.
3. Copy the InformaticaHadoop-<InformaticaForHadoopVersion>.rpm or InformaticaHadoop-
<InformaticaForHadoopVersion>.deb package to a shared directory based on the transfer protocol you
are using.
For example,
• HTTP: /var/www/html
• FTP: /var/ftp/pub
• NFS: <Shared location on the node>
The file location must be accessible by all the nodes in the cluster.

Installing in a Single Node Environment


You can install Big Data Management in a single node environment.

1. Log in to the machine.


2. Run the following command from the Big Data Management root directory to start the installation in
console mode:
bash InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Press 1 to install Big Data Management in a single node environment.
6. Press Enter.

14 Chapter 1: Big Data Management Installation


7. Type the absolute path for the Big Data Management installation directory and press Enter.
Start the path with a slash. The directory names in the path must not contain spaces or the following
special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \
If you type a directory path that does not exist, the installer creates the entire directory path on the node
during the installation. Default is /opt.
8. Press Enter.
The installer creates the /<BigDataManagementInstallationDirectory>/Informatica directory and
populates all of the file systems with the contents of the RPM package.
To get more information about the tasks performed by the installer, you can view the informatica-hadoop-
install.<DateTimeStamp>.log installation log file.

Installing in a Cluster Environment from the Primary Name Node


Using SCP Protocol
You can install Big Data Management in a cluster environment from the primary name node using SCP.

1. Log in to the primary name node.


2. Run the following command to start the Big Data Management installation in console mode:
bash InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Press 2 to install Big Data Management in a cluster environment.
6. Press Enter.
7. Type the absolute path for the Big Data Management installation directory.
Start the path with a slash. The directory names in the path must not contain spaces or the following
special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \
If you type a directory path that does not exist, the installer creates the entire directory path on each of
the nodes during the installation. Default is /opt.
8. Press Enter.
9. Press 1 to install Big Data Management from the primary name node.
10. Press Enter.
11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.
12. Press Enter.
13. Type y.
14. Press Enter.
The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the
DataNodes, the installer creates the Informatica directory and populates all of the file systems with the
contents of the RPM package. The Informatica directory is located here: /
<BigDataManagementInstallationDirectory>/Informatica
You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more
information about the tasks performed by the installer.

Big Data Management Installation from an RPM Package 15


Installing Big Data Management Using Another Protocol
You can install Big Data Management in a cluster environment from the primary name node using FTP,
HTTP, or NFS protocol.

1. Log in to the primary name node.


2. Run the following command to start the Big Data Management installation in console mode:
bash InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Press 2 to install Big Data Management in a cluster environment.
6. Press Enter.
7. Type the absolute path for the Big Data Management installation directory.
Start the path with a slash. The directory names in the path must not contain spaces or the following
special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \
If you type a directory path that does not exist, the installer creates the entire directory path on each of
the nodes during the installation. Default is /opt.
8. Press Enter.
9. Press 1 to install Big Data Management from the primary name node.
10. Press Enter.
11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.
12. Press Enter.
13. Type n.
14. Press Enter.
15. Type y.
16. Press Enter.
The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the
DataNodes, the installer creates the /<BigDataManagementInstallationDirectory>/Informatica
directory and populates all of the file systems with the contents of the RPM package.
You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more
information about the tasks performed by the installer.

Installing in a Cluster Environment from a Non-Name Node


Machine
You can install Big Data Management in a cluster environment from any machine in the cluster that is not a
name node.

1. Verify that the Big Data Management administrator has user root privileges on the node that will be
running the Big Data Management installation.
2. Log in to the machine as the root user.
3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop
cluster on which you want to install Big Data Management. The HadoopDataNodes file is located on the
node from where you want to launch the Big Data Management installation. You must add one IP
addresses or machine host names of the nodes in the Hadoop cluster for each line in the file.

16 Chapter 1: Big Data Management Installation


4. Run the following command to start the Big Data Management installation in console mode:
bash InformaticaHadoopInstall.sh
5. Press y to accept the Big Data Management terms of agreement.
6. Press Enter.
7. Press 2 to install Big Data Management in a cluster environment.
8. Press Enter.
9. Type the absolute path for the Big Data Management installation directory and press Enter. Start the
path with a slash. Default is /opt.
10. Press Enter.
11. Press 2 to install Big Data Management using the HadoopDataNodes file.
12. Press Enter.
The installer creates the /<BigDataManagementInstallationDirectory>/Informatica directory and
populates all of the file systems with the contents of the RPM package on the first node that appears in
the HadoopDataNodes file. The installer repeats the process for each node in the HadoopDataNodes file.

Big Data Management Installation from a Debian


Package
To install Big Data Management on Ubuntu Hadoop distributions on Azure HDInsight, download the tar.gz file
that includes a Debian package and the binary files that you need.

To enable Big Data Management in an Ubuntu Hadoop cluster environment, download, decompress, and run
the product installer.

Note: The default installation location of Informatica Hadoop binaries is /opt/Informatica. This location
cannot be changed.

Download the Debian Package


1. Download the following file to a temporary folder: InformaticaHadoop-
<InformaticaForHadoopVersion>.tar.gz
2. Extract the file to the machine from where you want to distribute the Debian package and run the Big
Data Management installation.
3. Copy the following package to a shared directory based on the transfer protocol you are using:
InformaticaHadoop-<InformaticaForHadoopVersion>.deb.
For example,
• HTTP: /var/www/html
• FTP: /var/ftp/pub
• NFS: <Shared location on the node>
The file location must be accessible by all the nodes in the cluster.
Note: The Debian package must be stored on a local disk and not on HDFS.

Big Data Management Installation from a Debian Package 17


Installing Big Data Management in a Single Node Environment
You can install Big Data Management in a single node environment.

1. Log in to the machine.


2. Run the following command from the Big Data Management root directory to start the installation in
console mode:
sudo bash InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Press 1 to install Big Data Management in a single node environment.
6. Press Enter.
To get more information about the tasks performed by the installer, you can view the informatica-hadoop-
install.<DateTimeStamp>.log installation log file.

Installing Big Data Management Using the SCP Protocol


You can install Big Data Management in a cluster environment from the primary namenode using the SCP
protocol.

1. Log in to the primary namenode.


2. Run the following command to start the Big Data Management installation in console mode:
sudo bash InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Press 2 to install Big Data Management in a cluster environment.
6. Press Enter.
7. Press 1 to install Big Data Management from the primary namenode.
8. Press Enter.
The installer installs Big Data Management in the HDInsight Hadoop cluster. The SCP utility copies the
product binaries to every node on the cluster in the following directory: /opt/Informatica.

You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more
information about the tasks performed by the installer.

Installing Big Data Management Using Another Protocol


You can install Big Data Management in a cluster environment from the primary NameNode using the FTP,
HTTP, or NFS protocol.

1. Log in to the primary NameNode.


2. Run the following command to start the Big Data Management installation in console mode:
sudo bash InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Press 2 to install Big Data Management in a cluster environment.

18 Chapter 1: Big Data Management Installation


6. Press Enter.
7. Press 1 to install Big Data Management from the primary NameNode.
8. Press Enter.
You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more
information about the tasks performed by the installer.

Installing Big Data Management in a Cluster Environment


You can install Big Data Management in a cluster environment from any machine in the cluster that is not a
name node.

1. Verify that the Big Data Management administrator has user root privileges on the node that will be
running the Big Data Management installation.
2. Log in to the machine as the root user.
3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop
cluster on which you want to install Big Data Management.
You must add one IP addresses or machine host names of the nodes in the Hadoop cluster for each line
in the file.
Note: The HadoopDataNodes file is located on the node from where you want to launch the Big Data
Management installation.
4. Run the following command to start the Big Data Management installation in console mode:
sudo bash InformaticaHadoopInstall.sh
5. Press y to accept the Big Data Management terms of agreement.
6. Press Enter.
7. Press 2 to install Big Data Management in a cluster environment.
8. Press Enter.

Big Data Management Installation from a Cloudera


Parcel Package
To install Big Data Management on Hadoop distributions on Cloudera, the parcel.tar file includes a Cloudera
Parcel package and the binary files that you need to run the Big Data Management installation.

To enable Big Data Management in a Cloudera Hadoop cluster environment, download, decompress, and run
the product installer.

Note: The default installation location of Informatica Hadoop binaries is /opt/Informatica. This location
cannot be changed.

Installing Big Data Management Using Cloudera Manager


You can install Big Data Management on a Cloudera CDH cluster using Cloudera Manager.

Perform the following steps:

1. Download the following file: INFORMATICA-<version>-informatica-<version>.parcel.tar.

Big Data Management Installation from a Cloudera Parcel Package 19


2. Extract manifest.json and the parcels from the .tar file.
3. Verify the location of your Local Parcel Repository.
In Cloudera Manager, click Administration > Settings > Parcels
4. Create a SHA file with the parcel name and hash listed in manifest.json that corresponds with your
Hadoop cluster.
For example, use the following parcel name for Hadoop cluster nodes that run Red Hat Enterprise Linux
6.4 64-bit:
INFORMATICA-<version>informatica-<version>-el6.parcel
Use the following hash listed for Red Hat Enterprise Linux 6.4 64-bit:
8e904e949a11c4c16eb737f02ce4e36ffc03854f
To create a SHA file, run the following command:
echo <hash> > <ParcelName> .sha
For example, run the following command:
echo “8e904e949a11c4c16eb737f02ce4e36ffc03854f” >
INFORMATICA-9.6.1-1.informatica9.6.1.1.p0.1203-el6.parcel.sha
5. Transfer the parcel and SHA file to the Local Parcel Repository with FTP.
6. Check for new parcels with Cloudera Manager.
To check for new parcels, click Hosts > Parcels.
7. Distribute the Big Data Management parcels.
8. Activate the Big Data Management parcels.

Informatica Big Data Management Uninstallation


The Big Data Management uninstallation deletes the Big Data Management binary files from all of the
DataNodes within the Hadoop cluster. Uninstall Big Data Management from a shell command.

Uninstalling Big Data Management


Run the Big Data Management uninstaller to uninstall Big Data Management in a single node or cluster
environment.

To uninstall Big Data Management on Cloudera, see “Uninstalling Big Data Management on Cloudera” on
page 21.

1. Verify that the Big Data Management administrator can run sudo commands.
2. If you are uninstalling Big Data Management in a cluster environment, set up password-less Secure
Shell (SSH) connection between the machine where you want to run the Big Data Management
installation and all of the nodes on which Big Data Management will be uninstalled.
3. If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,
verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the
nodes in the Hadoop cluster from which you want to uninstall Big Data Management. The
HadoopDataNodes file is located on the node from where you want to launch the Big Data Management
installation. You must add one IP addresses or machine host names of the nodes in the Hadoop cluster
for each line in the file.

20 Chapter 1: Big Data Management Installation


4. Log in to the machine. The machine you log into depends on the Big Data Management environment and
uninstallation method:
• If you are uninstalling Big Data Management in a single node environment, log in to the machine on
which Big Data Management is installed.
• If you are uninstalling Big Data Management in a cluster environment using the HADOOP_HOME
environment variable, log in to the primary name node.
• If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,
log in to any node.
5. Run the following command to start the Big Data Management uninstallation in console mode:
bash InformaticaHadoopInstall.sh
6. Press y to accept the Big Data Management terms of agreement.
7. Press Enter.
8. Select 3 to uninstall Big Data Management.
9. Press Enter.
10. Select the uninstallation option, depending on the Big Data Management environment:
• Select 1 to uninstall Big Data Management in a single node environment.
• Select 2 to uninstall Big Data Management in a cluster environment.
11. Press Enter.
12. If you are uninstalling Big Data Management in a cluster environment, select the uninstallation option,
depending on the uninstallation method:
• Select 1 to uninstall Big Data Management from the primary name node.
• Select 2 to uninstall Big Data Management using the HadoopDataNodes file.
13. Press Enter.
14. If you are uninstalling Big Data Management in a cluster environment from the primary name node, type
the absolute path for the Hadoop installation directory. Start the path with a slash.
The uninstaller deletes all of the Big Data Management binary files from the /
<BigDataManagementInstallationDirectory>/Informatica directory. In a cluster environment, the
uninstaller delete the binary files from all of the nodes within the Hadoop cluster.

Uninstalling Big Data Management on Cloudera


Uninstall Big Data Management on Cloudera from the Cloudera Manager.

1. In Cloudera Manager, browse to Hosts > Parcels > Informatica.


2. Select Deactivate.
Cloudera Manager stops the Informatica Big Data Management instance.
3. Select Remove.
The cluster uninstalls Informatica Big Data Manager.

Informatica Big Data Management Uninstallation 21


CHAPTER 2

Post-Installation Tasks
This chapter includes the following topics:

• Post-Installation Overview, 22
• Reference Data Requirements, 23
• Update Hadoop Cluster Configuration Parameters, 24
• Enable Developer Tool Communication with the Hadoop Cluster , 25
• Enable Support for Lookup Transformations with Teradata Data Objects, 25
• Big Data Management Configuration Utility, 26
• Download the JDBC Driver JAR Files for Sqoop Connectivity, 33
• Add Hadoop Environment Variable Properties, 33
• Configure Run-time Engines, 34

Post-Installation Overview
After you install Big Data Management, perform the post-installation tasks to ensure that Big Data
Management runs properly.

Perform the following tasks:

1. Optionally, install the Address Validation reference data.


2. Update Hadoop cluster configuration parameters for mappings in a Hadoop environment.
3. Add Hadoop environment variable properties.
4. Enable the Developer tool to communicate with the Hadoop cluster.
5. Enable support for Lookup transformations with Teradata data objects.
6. Optionally, run the Big Data Management Configuration Utility.
7. Configure Big Data Management for Hive CLI or HiveServer2.
8. Download the JDBC driver JAR files for Sqoop connectivity.
9. Configure runtime engines.

22
Reference Data Requirements
If you have a Data Quality product license, you can push a mapping that contains data quality
transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data
values are accurate and correctly formatted.

When you apply a pushdown operation to a mapping that contains data quality transformations, the operation
can copy the reference data that the mapping uses. The pushdown operation copies reference table data,
content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster
deletes the reference data that the pushdown operation copied with the mapping.

Note: The pushdown operation does not copy address validation reference data. If you push a mapping that
performs address validation, you must install the address validation reference data files on each DataNode
that runs the mapping. The cluster does not delete the address validation reference data files after the
address validation mapping runs.

Address validation mappings validate and enhance the accuracy of postal address records. You can buy
address reference data files from Informatica on a subscription basis. You can download the current address
reference data files from Informatica at any time during the subscription period.

Installing the Address Reference Data Files


To install the address reference data files on each DataNode in the cluster, create an automation script.

1. Browse to the address reference data files that you downloaded from Informatica.
You download the files in a compressed format.
2. Extract the data files.
3. Copy the files to the name node machine or to another machine that can write to the DataNodes.
4. Create an automation script to copy the files to each DataNode.

• If you copied the files to the name node, use the slaves file for the Hadoop cluster to identify the
DataNodes. If you copied the files to another machine, use the Hadoop_Nodes.txt file to identify the
DataNodes.
Find the Hadoop_Nodes.txt file in the Big Data Management installation package.
• The default directory for the address reference data files in the Hadoop environment
is /reference_data. If you install the files to a non-default directory, create the following custom
property on the Data Integration Service to identify the directory:
AV_HADOOP_DATA_LOCATION
Create the custom property on the Data Integration Service that performs the pushdown operation in
the native environment.
5. Run the automation script.
The script copies the address reference data files to the DataNodes.

Reference Data Requirements 23


Update Hadoop Cluster Configuration Parameters
Hadoop cluster configuration parameters that set Java library path in the mapred-site.xml file can override
the paths set in hadoopEnv.properties. Update the mapred-site.xml cluster configuration file on all the
cluster nodes to remove Java options that set the Java library path.

The following cluster configuration parameters in mapred-site.xml can override the Java library path set in
hadoopEnv.properties:

• mapreduce.admin.map.child.java.opts
• mapreduce.admin.reduce.child.java.opts

If the Data Integration Service cannot access the native libraries set in hadoopEnv.properties, mappings
can fail to run in a Hadoop environment.

After you install, perform the following steps:

• Update the cluster configuration file mapred-site.xml to remove the Java option -Djava.library.path
from the property configuration.
• Edit hadoopEnv.properties to include the user Hadoop libraries in the Java Library path.

Example to Update mapred-site.xml on Cluster Nodes


If mapred-site.xml sets the following configuration for mapreduce.admin.map.child.java.opts parameter:
<property>
<name>mapreduce.admin.map.child.java.opts</name>
<value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/:/mylib/ -
Djava.net.preferIPv4Stack=true</value>
<final>true</final>
</property>

The path to Hadoop libraries in mapreduce.admin.map.child.java.opts overrides following path set in


hadoopEnv.properties:
infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -
XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/
services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/
Linux-amd64-64 -Djava.security.egd=file:/dev/./urandom

To run mappings in a Hadoop environment, complete the following steps:

• Remove the -Djava.library.path Java option from mapreduce.admin.map.child.java.opts


parameter.
• Change hadoopEnv.properties to include the Hadoop libraries in the path /usr/lib/hadoop/lib/native
and /mylib/ with the following syntax:
infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:
+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=
$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:
$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/:/
mylib/ -Djava.security.egd=file:/dev/./urandom

24 Chapter 2: Post-Installation Tasks


Enable Developer Tool Communication with the
Hadoop Cluster
Edit developerCore.ini to enable the Developer tool to communicate with the Hadoop cluster on a particular
Hadoop distribution. After you edit the file, you must click run.bat to launch the Developer tool client again.

developerCore.ini is located in the following directory: <InformaticaClientInstallationDirectory>


\<version>\clients\DeveloperClient

Add the following property to developerCore.ini:

• -DINFA_HADOOP_DIST_DIR=hadoop\<Hadoop_distribution_name>_<version_number>

For example, the distribution name for a Hadoop cluster that runs MapR version 5.1 is mapr_5.1.0.

If you use the MapR distribution you must also set the MAPR_HOME environment variable to run MapR
mappings in a Hadoop environment. Perform the following additional tasks:

• Add the following properties to developerCore.ini:


- -Djava.library.path=hadoop\mapr_<version>\lib\native\Win64;bin;..\DT\bin

- -Dmapr.library.flatclass
• Edit run.bat to set the MAPR_HOME environment variable and the -clean settings.
For example, include the following lines:
MAPR_HOME=<InformaticaClientInstallationDirectory>/<version>/clients/DeveloperClient
\hadoop\mapr_<version>
developerCore.exe -clean
• Copy mapr-cluster.conf to the following directory on the machine where the Developer tool runs:
<Informatica installation directory>\<version>\clients\DeveloperClient\hadoop
\mapr_<version>\conf.
You can find mapr-cluster.conf in the following directory on any node in the Hadoop cluster: <MapR
installation directory>/conf

Enable Support for Lookup Transformations with


Teradata Data Objects
To use Lookup transformations with a Teradata data object in Hadoop pushdown mode, you must copy the
Teradata JDBC drivers to the Informatica installation directory.

You can download the Teradata JDBC drivers from Teradata. For more information about the drivers, see the
following Teradata website: http://downloads.teradata.com/download/connectivity/jdbc-driver.

The software available for download at the referenced links belongs to a third party or third parties, not
Informatica LLC. The download links are subject to the possibility of errors, omissions or change. Informatica
assumes no responsibility for such links and/or such software, disclaims all warranties, either express or
implied, including but not limited to, implied warranties of merchantability, fitness for a particular purpose, title
and non-infringement, and disclaims all liability relating thereto.

Copy the tdgssconfig.jar and terajdbc4.jar files from the Teradata JDBC drivers to the following
directory on the machine where the Data Integration runs and every node in the Hadoop cluster:
<Informatica installation directory>/externaljdbcjars

Enable Developer Tool Communication with the Hadoop Cluster 25


Additionally, you must copy the tdgssconfig.jar and terajdbc4.jar files to the following directory on the
machine where the Developer tool runs: <Informatica installation directory>\clients
\externaljdbcjars.

Big Data Management Configuration Utility


You can use the Big Data Management Configuration Utility to automate part of the configuration for Big Data
Management. Alternatively, you can manually configure Big Data Management.

The Big Data Management Configuration Utility assists with the following tasks:

• Creates configuration files on the machine where the Data Integration Service runs.
• Creates connections between the cluster and the Data Integration Service.
• Updates Data Integration Service properties in preparation for running mappings on the cluster.

After you run the utility, complete the configuration process for Big Data Management.

Note: The utility does not support Big Data Management for the following distributions:

• Highly available BigInsights clusters


• Amazon EMR
• Azure HDInsight
• MapR

To automate part of the configuration process, perform the following steps:

1. On the machine where the Data Integration Service runs, open the command line.
2. Go to the following directory: <Informatica installation directory>/tools/BDMUtil.
3. Run BDMConfig.sh.
4. Press Enter.
5. Choose the Hadoop distribution that you want to use to configure Big Data Management:

Option Description

1 Cloudera CDH

2 Hortonworks HDP

3 MapR

4 IBM BigInsights

Note: Select only 1 for Cloudera or 2 for Hortonworks. At this time, the utility does not support
configuration for MapR or BigInsights.
6. Based on the option you selected in step 5, see the corresponding topic to continue with the
configuration process:
• “Use Cloudera Manager” on page 27
• “Use Apache Ambari” on page 29
• “Use SSH” on page 31

26 Chapter 2: Post-Installation Tasks


Use Cloudera Manager
If you choose Cloudera Manager, perform the following steps to configure Big Data Management:

1. Select the version of Cloudera CDH to configure.


2. Choose the method to access files on the Hadoop cluster:
The following options appear:

Option Description

1 Cloudera Manager. Select this option to use the Cloudera Manager API to access files on the Hadoop cluster.

2 Secure Shell (SSH). Select this option to use SSH to access files on the Hadoop cluster. This option requires
SSH connections to the machines that host the name node, JobTracker, and Hive client. If you select this
option, Informatica recommends that you use an SSH connection without a password or have sshpass or
Expect installed.

Note: Informatica recommends the Cloudera Manager option.


3. Select whether to use HiveServer 2 to run mappings.
Note: If you select no, Big Data Management uses the default Hive driver to run mappings.
4. Verify the location of the Informatica Big Data Management installation directory on the cluster.
The default location appears, along with the following options:

Option Description

1 OK. Select this option to change the directory location.

2 Continue. Select this option to accept the default directory location.

5. Configure the connection to the Cloudera Manager.


a. Enter the Cloudera Manager host.
b. Enter the Cloudera user ID.
c. Enter the password for the user ID.
d. Enter the port for Cloudera Manager.
e. Select whether to use Tez as the execution engine type.
• 1 - No
• 2 - Yes
The Big Data Management Configuration Utility retrieves the required information from the Hadoop
cluster.
6. Select whether to exit the utility, or update the Data Integration Service and create connections.
Select from the following options:

Option Description

1 No. Select this option to exit the utility.

2 Yes. Select this option to continue.

Big Data Management Configuration Utility 27


7. Select whether to update Data Integration Service properties.
Select from the following options:

Option Description

1 No. Select this option to update Data Integration Service properties later.

2 Yes. Select this option to update Data Integration Service properties now.

8. Choose if you want to create connections to the Hadoop cluster:

Option Description

1. Hadoop Create a Hadoop connection.

2. Hive Create a Hive connection.

3. HDFS Create an HDFS connection.

4. HBase Create an HBase connection.

5. Select all Create all four types of connection.

Press the number that corresponds to your choice.


9. Supply information about the Informatica domain.
a. Enter the domain name.
b. Enter the node name.
c. Enter the domain user name.
d. Enter the domain password.
e. Enter the Data Integration Service name.
f. If the cluster is Kerberos-enabled, enter the following additional information:
• Hadoop kerberos service principal name
• Hadoop kerberos keytab location -- Location of the keytab on the Data Integration Service
machine.
Note: After you enter the Data Integration Service name, the utility tests the domain connection, and
then recycles the Data Integration Service.
10. In the Connection Details section, provide the connection properties.
Based on the type of connection you choose to create, the utility requires different properties. For more
information about the connection properties, see the Big Data Management User Guide.
Note: When you specify a directory path for the Blaze working directory or the Spark staging directory,
you must specify existing directories. The Big Data management utility does not validate the directory
paths that you specify.
The utility creates the connections that you configured.
11. The utility reports a summary of its operations, including whether connection creation succeeded, and
the location of utilty log files.
12. Complete the manual configuration steps for Big Data Management.
The utility creates the following files in the <Informatica installation directory>/tools/BDMUtil
directory:

28 Chapter 2: Post-Installation Tasks


ClusterConfig.properties
Contains details about the properties fetched from the Hadoop cluster and connection creation
commands that can be used to create connections to the Hadoop cluster.

Note: Edit the connection name, domain username and password to use the generated commands.

HiveServer2_EnvInfa.txt
Contains the list of environment variables and values that need to be copied to the HiveServer2
environment on the Hadoop cluster. This file is created only if you choose HiveServer2.

Use Apache Ambari


If you choose Ambari, perform the following steps to configure Big Data Management:

1. Select the Hadoop distribution directory to configure.


2. Choose the method to access files on the Hadoop cluster:
The following options appear:

Option Description

1 Apache Ambari. Select this option to use the Ambari REST API to access files on the Hadoop cluster.

2 Secure Shell (SSH). Select this option to use SSH to access files on the Hadoop cluster. This option requires
SSH connections to the machines that host the name node, JobTracker, and Hive client. If you select this
option, Informatica recommends that you use an SSH connection without a password or have sshpass or
Expect installed.

Note: Informatica recommends the Apache Ambari option.


3. Select whether to use HiveServer 2 to run mappings.
Note: If you select no, Big Data Management uses the default Hive Command Line Interface to run
mappings.
4. Verify the location of the Informatica Big Data Management installation directory on the cluster.
The default location appears, along with the following options:

Option Description

1 OK. Select this option to change the directory location.

2 Continue. Select this option to accept the default directory location.

5. Configure the connection to the Ambari Manager.


a. Enter the Ambari Manager host.
b. Enter the Ambari user ID.
c. Enter the password for the user ID.
d. Enter the port for Ambari Manager.
e. Select whether to use Tez as the execution engine type.
• 1 - No
• 2 - Yes
The Big Data Management Configuration Utility retrieves the required information from the Hadoop
cluster.

Big Data Management Configuration Utility 29


6. Select whether to exit the utility, or update the Data Integration Service and create connections.
Select from the following options:

Option Description

1 No. Select this option to exit the utility.

2 Yes. Select this option to continue.

7. Select whether to update Data Integration Service properties.


Select from the following options:

Option Description

1 No. Select this option to update Data Integration Service properties later.

2 Yes. Select this option to update Data Integration Service properties now.

8. Choose if you want to create connections to the Hadoop cluster:

Option Description

1. Hadoop Create a Hadoop connection.

2. Hive Create a Hive connection.

3. HDFS Create an HDFS connection.

4. HBase Create an HBase connection.

5. Select all Create all four types of connection.

Press the number that corresponds to your choice.


9. Supply information about the Informatica domain.
a. Enter the domain name.
b. Enter the node name.
c. Enter the domain user name.
d. Enter the domain password.
e. Enter the Data Integration Service name.
f. If the cluster is Kerberos-enabled, enter the following additional information:
• Hadoop kerberos service principal name
• Hadoop kerberos keytab location -- Location of the keytab on the Data Integration Service
machine.
Note: After you enter the Data Integration Service name, the utility tests the domain connection, and
then recycles the Data Integration Service.
10. In the Connection Details section, provide the connection properties.
Based on the type of connection you choose to create, the utility requires different properties. For more
information about the connection properties, see the Big Data Management User Guide.

30 Chapter 2: Post-Installation Tasks


Note: When you specify a directory path for the Blaze working directory or the Spark staging directory,
you must specify existing directories. The Big Data management utility does not validate the directory
paths that you specify.
The utility creates the connections that you configured.
11. The utility reports a summary of its operations, including whether connection creation succeeded, and
the location of utilty log files.
12. Complete the manual configuration steps for Big Data Management.
The utility creates the following files in the <Informatica installation directory>/tools/BDMUtil
directory:
ClusterConfig.properties
Contains details about the properties fetched from the Hadoop cluster and connection creation
commands that can be used to create connections to the Hadoop cluster.

Note: Edit the connection name, domain username and password to use the generated commands.

HiveServer2_EnvInfa.txt
Contains the list of environment variables and values that need to be copied to the HiveServer2
environment on the Hadoop cluster. This file is created only if you choose HiveServer2.

Use SSH
If you choose SSH, you must provide host names and Hadoop configuration file locations.

Note: Informatica recommends that you use an SSH connection without a password or have sshpass or
Expect installed. If you do not use one of these methods, you must enter the password each time the utility
downloads a file from the Hadoop cluster.

Verify the following host names: name node, JobTracker, and Hive client. Additionally, verify the locations for
the following files on the Hadoop cluster:

• hdfs-site.xml
• core-site.xml
• mapred-site.xml
• yarn-site.xml
• hive-site.xml
Perform the following steps to configure Big Data Management:

1. Enter the name node host name.


2. Enter the SSH user ID.
3. Enter the password for the SSH user ID.
If you use an SSH connection without a password, leave this field blank and press enter.
4. Enter the location for the hdfs-site.xml file on the Hadoop cluster.
5. Enter the location for the core-site.xml file on the Hadoop cluster.
The Big Data Management Configuration Utility connects to the name node and downloads the following
files: hdfs-site.xml and core-site.xml.
6. Enter the Yarn resource manager host name.
Note: Yarn resource manager was formerly known as JobTracker.
7. Enter the SSH user ID.

Big Data Management Configuration Utility 31


8. Enter the password for the SSH user ID.
If you use an SSH connection without a password, leave this field blank and press enter.
9. Enter the directory for the mapred-site.xml file on the Hadoop cluster.
10. Enter the directory for the yarn-site.xml file on the Hadoop cluster.
The utility connects to the JobTracker and downloads the following files: mapred-site.xml and yarn-
site.xml.
11. Enter the Hive client host name.
12. Enter the SSH user ID.
13. Enter the password for the SSH user ID.
If you use an SSH connection without a password, leave this field blank and press enter.
14. Enter the directory for the hive-site.xml file on the Hadoop cluster.
The utility connects to the Hive client and downloads the following file: hive-site.xml.
15. Choose if you want to create connections to the Hadoop cluster:

Option Description

1. Hadoop Create a Hadoop connection.

2. Hive Create a Hive connection.

3. HDFS Create an HDFS connection.

4. HBase Create an HBase connection.

5. Select all Create all four types of connection.

Press the number that corresponds to your choice.


16. Supply information about the Informatica domain.
a. Enter the domain name.
b. Enter the node name.
c. Enter the domain user name.
d. Enter the domain password.
e. Enter the Data Integration Service name.
f. If the cluster is Kerberos-enabled, enter the following additional information:
• Hadoop kerberos service principal name
• Hadoop kerberos keytab location -- Location of the keytab on the Data Integration Service
machine.
Note: After you enter the Data Integration Service name, the utility tests the domain connection, and
then recycles the Data Integration Service.
17. In the Connection Details section, provide the connection properties.
Based on the type of connection you choose to create, the utility requires different properties. For more
information about the connection properties, see the Big Data Management User Guide.
Note: When you specify a directory path for the Blaze working directory or the Spark staging directory,
you must specify existing directories. The Big Data management utility does not validate the directory
paths that you specify.

32 Chapter 2: Post-Installation Tasks


The utility creates the connections that you configured.
18. The utility reports a summary of its operations, including whether connection creation succeeded, and
the location of utilty log files.
19. Complete the manual configuration steps for Big Data Management.
The utility creates the following files in the <Informatica installation directory>/tools/BDMUtil
directory:
ClusterConfig.properties
Contains details about the properties fetched from the Hadoop cluster and connection creation
commands that can be used to create connections to the Hadoop cluster.

Note: Edit the connection name, domain username and password to use the generated commands.

HiveServer2_EnvInfa.txt
Contains the list of environment variables and values that need to be copied to the HiveServer2
environment on the Hadoop cluster. This file is created only if you choose HiveServer2.

Download the JDBC Driver JAR Files for Sqoop


Connectivity
To configure Sqoop connectivity for relational databases, you must download the relevant JDBC driver jar
files and copy the jar files to the node where the Data Integration Service runs. At run time, the Data
Integration Service copies the jar files to the Hadoop distribution cache so that the jar files are accessible to
all nodes in the Hadoop cluster.

You can use any Type 4 JDBC driver that the database vendor recommends for Sqoop connectivity.

Note: The DataDirect JDBC drivers that Informatica ships are not licensed for Sqoop connectivity.

1. Download the JDBC driver jar files for the database that you want to connect to.
2. On the node where the Data Integration Service runs, copy the JDBC driver jar files to the following
directory:
<Informatica installation directory>\externaljdbcjars
If the Data Integration Service runs on a grid, repeat this step on all nodes in the grid.

Add Hadoop Environment Variable Properties


You can optionally add third-party environment variables or extend the existing PATH environment variable in
the Hadoop environment properties file, hadoopEnv.properties.

Perform this task manually if you do not use the Big Data Management configuration utility. For more
information about the utility, see “Big Data Management Configuration Utility” on page 26.

1. Go to the following location: <InformaticaInstallationDir>/services/shared/hadoop/


<Hadoop_distribution_name>_<version_number>/infaConf
2. Find the file named hadoopEnv.properties.
3. Back up the file before you modify it.

Download the JDBC Driver JAR Files for Sqoop Connectivity 33


4. Use a text editor to open the file and modify the properties.
5. Save the properties file with the name hadoopEnv.properties.

Configure Run-time Engines


You can run mappings in the Informatica native environment, or choose a run-time engine to run mappings in
the Hadoop environment.

When you choose the native run-time engine, Big Data Management uses the Data Integration to run
mappings on the Informatica domain. You can also choose a run-time engine to run mappings in the Hadoop
environment. This pushes mapping run processing to the cluster.

When you want to run mappings on the cluster, you choose from the following run-time engines:
Blaze engine
The Blaze engine is an Informatica software component that can run mappings on the Hadoop cluster.

Spark engine
Spark is an Apache project that provides a run-time engine that can run mappings on the Hadoop
cluster.

Hive engine
When you run mappings on the Hive run-time engine, you choose Hive Command Line Interface or
HiveServer 2.

Note: Hive Command Line Interface is commonly abbreviated Hive CLI.

Blaze Engine Configuration


You can use the Blaze runtime engine to run mappings in the Hadoop environment.

The Blaze engine only supports the following distributions:

• Azure HDInsight
• Cloudera CDH
• Hortonworks HDP Hadoop
• MapR
• IBM BigInsights

Skip the tasks for the Blaze engine if you run Big Data Management on another Hadoop distribution.

Perform the following configuration tasks in the Big Data Management installation:

1. Configure Blaze on Kerberos-enabled clusters.


2. Configure Blaze engine log directories.
3. Reset system settings to allow more processes and files.
4. Open required ports for the Blaze engine.
5. Allocate cluster resources for Blaze.

Depending on the Hadoop environment, you perform additional steps in the Hadoop cluster to allow Big Data
Management to use the Blaze engine to run mappings. See " Chapter 3, “Configuring Big Data Management
to Run Mappings in Hadoop Environments” on page 41."

34 Chapter 2: Post-Installation Tasks


Configure Blaze on Kerberos-Enabled Clusters
If you want to configure Blaze to run mappings on Kerberos-enabled clusters Cloudera or Hortonworks, you
manually copy certain configuration files when the Big Data Management Configuration Utility finishes.

1. Copy the following files from the cluster name node:


• core-site.xml
• hdfs-site.xml
2. Paste the files to the following directory on the VM that runs the Data Integration Service:
<InformaticaInstallationDir>/services/shared/hadoop/<HadoopDistributionName>/conf

Configure Blaze Engine Log Directories


The hadoopEnv.properties file lists the log directories that the Blaze engine uses on the node and on
HDFS. You must grant write permission on log directories for the user account that starts the Blaze engine.

Grant write permission for log directories on the user account that starts the Blaze engine in the following
properties:

• infagrid.node.local.root.log.dir
• infacal.hadoop.logs.directory

For more information about user accounts for the Blaze engine, see the Informatica Big Data Management
Security Guide.

Reset System Settings to Allow More Processes and Files


If you want to use Blaze to run mappings on the Hadoop cluster, increase operating system settings on the
machine that hosts the Data Integration Service to allow more user processes and files.

To get a list of the operating system settings, including the file descriptor limit, run the following command:
C Shell
limit

Bash Shell
ulimit -a

Informatica service processes can use a large number of files. Set the file descriptor limit per process to
16,000 or higher. The recommended limit is 32,000 file descriptors per process.

To change system settings, run the limit or ulimit command with the pertinent flag and value. For example, to
set the file descriptor limit, run the following command:
C Shell
limit -h filesize <value>

Bash Shell
ulimit -n <value>

Open the Required Ports for the Blaze Engine


You must open a range of ports for the Blaze engine to use to communicate with the Informatica domain.

Note: Skip this task if the Blaze engine does not support the distribution that the Hadoop cluster runs.

Configure Run-time Engines 35


If the Hadoop cluster is behind a firewall, work with your network administrator to open the range of ports that
the Blaze engine uses.

When you create the Hadoop connection, specify the port range that the Blaze engine can use with the
minimum port and maximum port fields.

Allocate Cluster Resources for Blaze


When you use Blaze to run mappings, verify that the cluster allocates sufficient memory and resources to
management and runtime services.

Allocate the following types of resource for each container on the cluster:
Memory
Random Access Memory (RAM) available for each container. This setting is also known as the container
size. You can set the minimum and maximum memory per container.

On each of the data nodes on the cluster:

• Set the minimum container memory to allow the VM to spawn sufficient containers.
• Set maximum memory on the cluster to increase resource memory available to Blaze services.

Vcore
A vcore is a virtual core. The number of virtual cores per container may correspond to the number of
physical cores on the cluster, but you can increase the number to allow for more processing. You can set
the minimum and maximum number of vcores per container.

The following table contains resource allocation guidelines:

Node Type Resources Required Per Container

Runtime node -- runs mappings only - Minimum memory: Set to no less than 4 GB less than the maximum memory.
- At least 10 GB maximum memory
- 6 vcores

Management node -- a single node that runs - Minimum memory: Set to no less than 4 GB less than the maximum memory.
mappings and management services - At least 13 GB maximum memory
- 9 vcores

Set the resources in the configuration console for the cluster, or edit the file yarn-site.xml.

To edit resource settings in yarn-site.xml:

1. Use yarn.nodemanager.resource.memory-mb to set the maximum memory setting.


2. Use yarn.scheduler.minimum-allocation-mb to set the minimum memory setting.
3. Use yarn.nodemanager.resource.cpu-vcores to set the number of vcores.

Spark Engine Configuration


If you want to use the Spark runtime engine to run mappings in the Hadoop environment, perform the
following configuration tasks in the Big Data Management installation.

1. Reset system settings to allow more processes and files.


2. Enable dynamic allocation.
3. Enable the Spark shuffle service.
The Spark engine only supports the Cloudera CDH and Hortonworks HDP Hadoop distributions. Skip the
tasks for the Spark engine if you run Big Data Management on another Hadoop distribution.

36 Chapter 2: Post-Installation Tasks


Reset System Settings to Allow More Processes and Files
If you want to use Spark to run mappings on the Hadoop cluster, increase operating system settings on the
machine that hosts the Data Integration Service to allow more user processes and files.

To get a list of the operating system settings, including the file descriptor limit, run the following command:
C Shell
limit

Bash Shell
ulimit -a

Informatica service processes can use a large number of files. Set the file descriptor limit per process to
16,000 or higher. The recommended limit is 32,000 file descriptors per process.

To change system settings, run the limit or ulimit command with the pertinent flag and value. For example, to
set the file descriptor limit, run the following command:
C Shell
limit -h filesize <value>

Bash Shell
ulimit -n <value>

Set Up Dynamic Allocation


You can dynamically adjust the resources that an application occupies based on the workload. To enable
dynamic allocation for the Spark engine, you must set up the Spark shuffle service and configure the
hadoopEnv.properties file.

1. Locate the Spark shuffle .jar file and note the location.
• For HortonWorks implementations, the file is located in the following path: /opt/Informatica/
services/shared/hadoop/hortonworks_<version_number>/spark/lib/spark-<version_number>-
yarn-shuffle.jar
• For Cloudera implementations, the file is located in the following path: /<Informatica installation
directory>/services/shared/hadoop/cloudera_<version_number>/spark/lib/spark-
<version_number>-yarn-shuffle.jar
2. Add the Spark shuffle .jar file location to the classpath of each cluster node manager.
3. Edit the yarn-site.xml file in each cluster node manager.
The file is located in the following location:
• For HortonWorks implementations, the file is located in the following path: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
• For Cloudera implementations, the file is located in the following path: <Informatica installation
directory>/services/shared/hadoop/cloudera_cdh<version>/conf
a. Change the value of the yarn.nodemanager.aux-services property as follows:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
b. Add the following property-value pair:
yarn.nodemanager.aux-
services.spark_shuffle.class=org.apache.spark.network.yarn.YarnShuffleService

Configure Run-time Engines 37


Configure spark.dynamicAllocation.enabled and spark.shuffle.service.enabled in the Hadoop
environment properties file.

Configure Performance Properties


To improve performance of mappings that run on the Spark run-time engine, you can configure Spark
properties within the Hadoop properties file, hadoopEnv.properties.

1. Open hadoopEnv.properties and back it up.


You can find the file in the following location: <Informatica installation directory>/services/
shared/hadoop/<Hadoop_distribution_name>_<version_number>/infaConf/
2. Configure the following properties:

Property Value Description

spark.dynamicAllocation.enabled TRUE Enables dynamic resource allocation. Required when


you enable the external shuffle service.

spark.shuffle.service.enabled TRUE Enables the external shuffle service. Required when


you enable dynamic allocation.

spark.scheduler.maxRegisteredResourcesWaitingTime 15000 The number of milliseconds to wait for resources to


register before scheduling a task. Reduce this from
the default value of 30000 to reduce any delay before
starting the Spark job execution.

spark.scheduler.minRegisteredResourcesRatio 0.5 The minimum ratio of registered resources to acquire


before task scheduling begins. Reduce this from the
default value of 0.8 to reduce any delay before
starting the Spark job execution.

3. Locate the spark.executor.instances property and place a # character at the beginning of the line to
comment it out.
Note: If you enable dynamic allocation for the Spark engine, Informatica recommends that you comment
out this property.
After editing, the line appears as follows:
#spark.executor.instances=100

Hive Engine Configuration


You can choose to run mappings using the Hive Command Line Interface (CLI) or HiveServer2.

The default tool is Hive CLI. When you choose the Hive Command Line Interface (CLI) to run mappings, no
configuration is required.

Choose Hive CLI or HiveServer2


If you use the Big Data Management Configuration Utility to configure Big Data Management, you can choose
to use Hive CLI or HiveServer2 during the configuration process. If you use the utility to configure Big Data

38 Chapter 2: Post-Installation Tasks


Management, the utility uncomments or comments an entry in the hadoopEnv.properties file based on your
choice.

Alternatively, you can edit the hadoopEnv.properties file to choose Hive CLI or HiveServer2. You can find
the hadoopEnv.properties file in the following directory: <Informatica installation directory>/
services/shared/hadoop/<Hadoop_distribution_name>/infaConf.

Configuring Hadoop Cluster Properties and the Data Integration Service to


Use HiveServer2
The hadoopEnv.properties file contains two entries for the infapdo.aux.jars.path property. By default,
the entry for Hive CLI is uncommented, and the entry for HiveServer2 is commented out. Manually edit the
entries to choose a method to run mappings.

1. Assign the required permissions on the cluster to the user account specified in the Hive connection.
For example, the user account testuser1 belongs to the "Other" user group. To use this account, verify
that the "Other" user group has permissions on the Hive Warehouse Directory.
Additionally, testuser1 must have the following permissions:
• Full permission on the staging directory
• Full permission on the /tmp/hive-<username> directory
• Read and write permission on the /tmp directory
2. Edit the Hadoop environment properties file to set HiveServer2 as the tool to run mappings.
Note: Skip this step if you used the Big Data Management Utility to configure Hadoop properties for Big
Data Management.
a. Browse to the hadoopEnv.properties file in the following directory: <Informatica installation
directory>/services/shared/hadoop/hortonworks_<version_number>/infaConf.
The hadoopEnv.properties file contains two entries for the infapdo.aux.jars.path property.
The default value is Hive CLI, and the entry for HiveServer2 is commented out.
b. To use HiveServer2, comment out the Hive CLI entry, and uncomment the HiveServer2 entry.
3. Use the Administrator tool in the Informatica domain to configure the Data Integration Service for
HiveServer2.
Note: Skip this step if you used the Big Data Management Utility to configure Hadoop properties for Big
Data Management.
a. Log in to the Administrator tool.
b. In the Domain Navigator, select the Data Integration Service.
c. In the Processes tab, create the following custom property:
ExecutionContextOptions.hive.executor.
d. Set the value to hiveserver2.
e. Recycle the Data Integration Service.
4. Disable SQL-based authorization for HiveServer2.
5. Optionally, enable storage-based authorization.

Configure Run-time Engines 39


Assign Permissions on the Hadoop Cluster
To use HiveServer2, you must assign the required permissions to the user account specified in the Hive
connection.

For example, the user account testuser1 belongs to the "Other" user group. Verify that the "Other" user group
has permissions on the Hive Warehouse Directory.

Additionally, testuser1 must have the following permissions on the HDFS directories:

• Full permission on the staging directory


• Full permission on the /tmp/hive-<username> directory
• Read and write permission on the /tmp directory

Troubleshooting HiveServer2
Consider the following troubleshooting tips when you configure HiveServer2.

A mapping fails with the following error: java.lang.OutOfMemoryError: Java heap space
Increase the heap size that mapReduce can use with HiveServer2 to run mappings.

To configure the heap size, you must edit hadoopEnv.properties. You can find hadoopEnv.properties in
the following directory: <Informatica installation directory>/services/shared/hadoop/
hortonworks_<version>/infaConf.

Find the infapdo.java.opts property in hadoopEnv.properties.

Add the following values: -Xms<memory size>m -Xmx<memory size>m.

The following sample text shows the infapdo.java.opts property with a modified heap size:
infapdo.java.opts=-Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:
$HADOOP_NODE_HADOOP_DIST/lib/native:$HADOOP_NODE_HADOOP_DIST/lib/*:
$HADOOP_NODE_HADOOP_DIST/lib/native -Djava.security.egd=file:/dev/./urandom -Xms3150m -
Xmx6553m -XX:MaxPermSize=512m

40 Chapter 2: Post-Installation Tasks


CHAPTER 3

Configuring Big Data


Management to Run Mappings in
Hadoop Environments
This chapter includes the following topics:

• Mappings on Hadoop Distributions Overview, 41


• Enable Mappings in a Hadoop Environment, 42
• Configuring Big Data Management in the Amazon EMR Environment, 43
• Configuring Big Data Management in the Azure HDInsight Cloud Environment, 47
• Configuring Big Data Management in the Cloudera Environment, 73
• Configuring Big Data Management in the Hortonworks HDP Environment, 80
• Configuring Big Data Management in the IBM BigInsights Environment, 93
• Configuring Big Data Management in the MapR Environment, 97

Mappings on Hadoop Distributions Overview


After you install Big Data Management, you must enable mappings to run on a Hadoop cluster on a Hadoop
distribution.

After you enable Informatica mappings to run on a Hadoop cluster, you must configure the Big Data
Management Client files to communicate with a Hadoop cluster on a particular Hadoop distribution. You can
use the Big Data Management Configuration Utility to automatically configure some of the properties. After
you run the utility, you must complete the configuration for your Hadoop distribution.

Alternatively, you can manually Big Data Management without the utility.

The following table describes the Hadoop distributions and schedulers that you can use with Big Data
Management:

Hadoop Distribution Scheduler

Cloudera CDH 5.5 Fair Scheduler

Hortonworks HDP 2.3 CapacityScheduler

41
Hadoop Distribution Scheduler

IBM BigInsights 4.1 CapacityScheduler and Fair Scheduler

MapR 5.1 CapacityScheduler and Fair Scheduler

Enable Mappings in a Hadoop Environment


To enable mappings in a Hadoop environment, perform the following configuration tasks.

• Configure Hive variables.


• Update Hadoop cluster configuration parameters.
• Configure library path and path variables.

You might also have to perform additional steps, depending on your Hadoop environment.

Configure Hive Variables for Mappings in a Hadoop Environment


To run mappings in a Hadoop environment, configure Hive environment variables.

You can configure Hive environment variables in the file /<BigDataManagementInstallationDirectory>/


Informatica/services/shared/hadoop/<Hadoop_distribution_name>/conf/hive-site.xml.
Configure the following Hive environment variables:

Disable predicate pushdown


You cannot use predicate pushdown optimization for a Hive query that uses multiple insert statements.
Disable predicate pushdown optimization to get accurate results for mappings with Hive version 0.9.0.

The default Hadoop RPM installation sets hive.optimize.ppd to FALSE. Retain this value.

Dynamic Partition-related Variables


If you want to use Hive dynamic partitioned tables, configure the following variables:
hive.exec.dynamic.partition
Set this property to TRUE to enable dynamic partitioned tables.

exec.dynamic.partition.mode
Set this property to nonstrict. This allows all partitions to be dynamic.

Configure Environment Variables in the Big Data Management


Hadoop Environment Properties File
To add environment variables or to extend existing ones, use the Hadoop environment properties file,
hadoopEnv.properties.

You can optionally add third-party environment variables or extend the existing PATH environment variable in
hadoopEnv.properties.

1. Go to the following location: <Informatica installation directory>/services/shared/hadoop/


hortonworks_2.3/infaConf

42 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


2. Find the file hadoopEnv.properties.
3. Back up the file before you modify it.
4. Use a text editor to open the file and modify the properties.
5. Save the properties file with the name hadoopEnv.properties.

Configuring Library Path and Path Variables for Mappings in a


Hadoop Environment
To run mappings in a Hadoop environment configure the library path and path environment variables in the
hadoopEnv.properties file.

You configure some environment variables for all Hadoop distributions. Other environment variables that you
configure depend on the Hadoop distribution.

Configure the following library path and path environment variables for all Hadoop distributions:

• When you run mappings in a Hadoop environment, configure the ODBC library path before the Teradata
library path. For example, infapdo.env.entry.ld_library_path=LD_LIBRARY_PATH=
$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/ODBC7.0/lib/:/opt/
teradata/client/13.10/tbuild/lib64:/opt/teradata/client/13.10/odbc_64/lib:/databases/
oracle11.2.0_64BIT/lib:/databases/db2v9.5_64BIT/lib64/:$HADOOP_NODE_INFA_HOME/
DataTransformation/bin:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-
amd64-64:$LD_LIBRARY_PATH .

Configuring Big Data Management in the Amazon


EMR Environment
You can enable Informatica mappings to run on an Amazon EMR cloud cluster.

Before you can configure Big Data Management 10.1 to enable mappings to run on an Amazon EMR cluster,
you must download and install EBF 17557 on top of Big Data Management 10.1.

After you install the EBF, you complete the following configuration tasks:

Complete the following tasks in the Informatica domain:

1. Configure yarn-site.xml for the Data Integration Service.


2. Configure the Hadoop cluster for Hive tables on Amazon S3.
3. Configure the Hadoop pushdown properties for the Data Integration Service.
4. Edit Informatica Developer files and variables.
5. Open ports on the Hadoop cluster.

Complete the following tasks in the Amazon EMR cluster:

1. Configure the Hadoop cluster for the Blaze engine.


2. Start the Hadoop application timeline server.
3. Configure and start the Blaze engine console.
4. Verify Data Integration Service user permissions.
5. Create a Blaze directory and grant user permissions.

Configuring Big Data Management in the Amazon EMR Environment 43


Install the EBF
Before you can configure Big Data Management 10.1 to enable mappings to run on an Amazon EMR cluster,
you must download and install EBF 17588 on top of Big Data Management 10.1.

Domain Configuration Tasks


To update the Informatica domain to enable mappings to run on Amazon EMR, perform the following tasks:

1. Configure yarn-site.xml for the Data Integration Service.


2. Configure the Hadoop cluster for Hive tables on Amazon S3.
3. Configure the Hadoop pushdown properties for the Data Integration Service.
4. Edit Informatica Developer files and variables.
5. Open ports on the Hadoop cluster.

Configure yarn-site.xml for the Data Integration Service


Configure the Amazon EMR cluster properties in the yarn-site.xml file that the Data Integration Service uses
when it runs mappings in a Hadoop cluster.

1. Make a note of the master host node name from the cluster at the following location:
/etc/hadoop/conf/yarn-site.xml
2. Open the following file for editing:
<Informatica_installation_directory>/conf/yarn-site.xml
3. Replace all instances of HOSTNAME with the master host node name.

Configure the Hadoop Cluster for Hive Tables on Amazon S3


You must configure properties in the hive-site.xml file to use the Hive engine or the Blaze engine to run
mappings when you have Hive tables on Amazon S3.

Configure the following properties:


Access Key ID
ID to use to connect to the Amazon S3 file system.

Secret Access Key


Password to connect to the Amazon S3 file system.

The following sample text shows the properties you configure in the hive-site.xml file:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value><your-s3-access-key-id></value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value><your-s3-accesskey></value>
</property>

Configure the Hadoop Pushdown Properties for the Data Integration Service
Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hadoop
environment.

You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.

44 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


The following table describes the Hadoop pushdown properties for the Data Integration Service:

Property Description

Informatica Home The Big Data Management home directory on every data node created by the Hadoop RPM install.
Directory on Hadoop Type /opt/Informatica.

Hadoop Distribution The directory containing a collection of Hive and Hadoop JARS on the cluster from the RPM Install
Directory locations. The directory contains the minimum set of JARS required to process Informatica mappings in
a Hadoop environment. Type /opt/Informatica/services/shared/hadoop/
amazon_emr<version_number>.

Data Integration The Hadoop distribution directory on the Data Integration Service node. Type ../../services/
Service Hadoop shared/hadoop/amazon_emr<version_number>.
Distribution Directory
The contents of the Data Integration Service Hadoop distribution directory must be identical to Hadoop
distribution directory on the data nodes.

Edit Informatica Developer Files and Variables


Edit developerCore.ini to enable the Developer tool to communicate with the Hadoop cluster on a particular
Hadoop distribution. After you edit the file, you must click run.bat to launch the Developer tool again.

developerCore.ini is located in the following directory: <Informatica installation directory>\clients


\DeveloperClient

Add the following property to developerCore.ini


-DINFA_HADOOP_DIST_DIR=hadoop\amazon_emr_<version_number>

Open Ports on the Hadoop Cluster


You must open a range of ports to enable the Informatica domain to communicate with the Hadoop cluster.

Open the following ports:

• 8020
• 8032
• 8080
• 9083
• 9080 -- for the Blaze monitoring console
• 12300 to 12600 -- for the Blaze engine.

Optionally, you can also open the following ports for debugging: 8088, 19888, and 50070.

Cluster Configuration Tasks


To update the Hadoop cluster to enable mappings to run on Amazon EMR, perform the following tasks:

1. Configure the Hadoop cluster for the Blaze engine.


2. Start the Hadoop application timeline server.
3. Configure and start the Blaze engine console.
4. Verify Data Integration Service user permissions.
5. Create a Blaze directory and grant user permissions.

Configuring Big Data Management in the Amazon EMR Environment 45


Configure yarn-site.xml for the Application Timeline Server
You must configure properties in the yarn-site.xml file on every node in the Hadoop cluster to enable the
Application Timeline Server.

Configure the following properties:


yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.

Set this value to true.

yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.

Use 3600000.

yarn.nodemanager.local-dirs
List of directories to store localized files in.

The Blaze engine uses local directories for a distributed cache.

The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>

46 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Start the Hadoop Application Timeline Server
The Blaze engine uses the Hadoop Application Timeline Server to store the Job monitor status.

To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &

Enable the Blaze Engine Console


Enable the Blaze engine console in the hadoopEnv.properties file.

1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.

Verify Data Integration Service User Permissions


To run mappings using the Hive engine, verify that the Data Integration Service user has permissions for the
Hive warehouse directory.

For example, if the warehouse directory is /user/hive/warehouse, run the following command to grant the
user permissions for the directory:
hadoop fs –chmod –R 777 /user/hive/warehouse

Create Blaze Directory and Grant User Permissions


To run mappings using the Blaze engine, create a directory and set permissions for it.

1. To create a Blaze directory on HDFS, run the following command:


hadoop fs mkdir -p /blaze/workdir
2. To specify permissions on the directory, run the following command:
hadoop fs –chmod –R 777 /blaze

Configuring Big Data Management in the Azure


HDInsight Cloud Environment
You can enable Informatica mappings to run in the Azure HDInsight cloud environment.

Informatica supports HDInsight clusters that are deployed on Microsoft Azure.

To enable Informatica mappings to run on an HDInsight cluster, complete the following steps:

1. Verify prerequisites.
2. Perform post-distribution tasks.
3. Populate the HDFS File System.
4. Update cluster configuration settings.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 47


5. Configure the Hadoop pushdown properties for the Data Integration Service.
6. Edit Informatica Developer files and variables.
7. Configure environment variables in the Hadoop environment.
8. Configure Hadoop cluster properties in hive-site.xml and yarn-site.xml.
9. Configure Hadoop cluster properties on the Data Integration Service machine.
10. Configure Big Data Management on the HDInsight cluster.
11. Configure mappings to run on the Hadoop cluster.
12. Configure and start Informatica services.
13. Configure settings on the Informatica domain.
14. Configure connections.

Prerequisites to Configure Big Data Management for Azure


HDInsight
Before you launch the process to configure Big Data Management in the Azure cloud environment, check that
you have fulfilled the following prerequisites.

• You have an instance of HDInsight in a supported Linux cluster up and running on the Azure environment.
Refer to the Product Availability Matrix on the Informatica Network for all platform compatibility details.
• You have permission to access and administer the HDInsight instance, and to get the names and
addresses of cluster resources and other information from cluster configuration pages.
• If HBase is not already installed, install it.

Perform Post-Installation Tasks


Perform the following tasks in chapter 2, "Post-Installation Tasks," before you proceed.

1. Configure settings on the Informatica domain.


2. Configure the Blaze engine log directories.

Populate the HDFS File System


After you install Big Data Management, populate the HDFS file system.

Informatica supports read/write from the local HDFS location, but not the wasb location. In an HDInsight
cluster, the default environment has a local HDFS location that is empty, and a wasb location populated with
files. Perform the following steps to copy files from the wasb location to the local HDFS location:

1. Use the Ambari configuration tool to identify the wasb location and the HDFS location.
You can find these locations as follows:
wasb location
The wasb location is a resource locator like:
wasb://<cluster_name>@<domain_or_IP-address>/
HDFS location
The HDFS location is a resource locator like:
hdfs://<headnode_IP_address>:<port_number>/

48 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Note: Make a note of these locations. You might need them later, during the configuration process, to
define property values.
2. Copy the folder /hdp/apps/<CurrentVersion>/ and all its contents from the wasb location to the hdfs
location, with the same folder structure.
The HDFS file system is populated.
3. Set the value of the fs.defaultFS property to the HDFS location.
After you restart, the cluster will populate files in the HDFS location.
Optionally, you can change the value of fs.defaultFS to the wasb location, and restart the affected
components again.
Note: If it is undesirable to restart cluster components, you can manually copy files from the wasb
location to the local HDFS location.

Configure the Hadoop Pushdown Properties for the Data


Integration Service
Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hadoop
environment.

You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.

The following table describes the Hadoop pushdown properties for the Data Integration Service:

Property Description

Informatica Home The Big Data Management home directory on every data node created by the Hadoop Debian install.
Directory on Hadoop Type /opt/Informatica.

Hadoop Distribution The directory containing a collection of Hive and Hadoop JARS on the cluster from the Debian install
Directory locations. The directory contains the minimum set of JARS required to process Informatica mappings
in a Hadoop environment. Type /opt/Informatica/services/shared/hadoop/
hortonworks_2.3.

Data Integration The Hadoop distribution directory on the Data Integration Service node. Type ../../services/
Service Hadoop shared/hadoop/hortonworks_2.3.
Distribution Directory
The contents of the Data Integration Service Hadoop distribution directory must be identical to
Hadoop distribution directory on the data nodes.

Configuring the Hadoop Distribution Directory


You can modify the Hadoop distribution directory on the data nodes.

When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop
JARS, and the Snappy libraries required to process Informatica mappings in a Hadoop environment from
your Hadoop install location. The actual Hive and Hadoop JARS can vary depending on the Hadoop
distribution and version.

The Hadoop Debian distribution installs the Hadoop distribution directories in the following path:
<BigDataManagementInstallationDirectory>/Informatica/services/shared/hadoop.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 49


Edit Informatica Developer Files and Variables
Edit developerCore.ini to enable the Developer tool to communicate with the Hadoop cluster on a particular
Hadoop distribution. After you edit the file, you must click run.bat to launch the Developer tool again.

developerCore.ini is located in the following directory: <Informatica installation directory>\clients


\DeveloperClient

Add the following property to developerCore.ini


-DINFA_HADOOP_DIST_DIR=hadoop\hortonworks_2.3

Configure Environment Variables in the Big Data Management


Hadoop Environment Properties File
To add environment variables or to extend existing ones, use the Hadoop environment properties file,
hadoopEnv.properties.

You can optionally add third-party environment variables or extend the existing PATH environment variable in
hadoopEnv.properties.

1. Go to the following location: <Informatica installation directory>/services/shared/hadoop/


hortonworks_2.3/infaConf
2. Find the file hadoopEnv.properties.
3. Back up the file before you modify it.
4. Use a text editor to open the file and modify the properties.
5. Save the properties file with the name hadoopEnv.properties.

Configure Hadoop Cluster Properties on the Data Integration


Service
Configure Hadoop cluster properties in files that the Data Integration Service uses when it runs mappings on
a HDInsight cluster.

Configure hive-site.xml for the Data Integration Service


Configure the Hortonworks cluster properties in the hive-site.xml file that the Data Integration Service uses
when it runs mappings in a Hadoop cluster.

1. Open the hive-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_2.3/conf/
To run a mapping in HiveServer2, configure the following properties in the hive-site.xml file:
hive.metastore.uris
URI for the metastore host.
For example:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<HOSTNAME>:9083</value>
</property>
yarn.app.mapreduce.am.staging-dir
The directory where submitted jobs that use MapReduce are staged.

50 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


For example:
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value><staging_directory_path></value>
</property>
2. Use the Ambari cluster management configuration tool to get the values of the following properties from
the cluster, and add the properties to hive-site.xml:
• <name>fs.azure.account.key.ilabsstoragevpn.blob.core.windows.net</name>
• <name>fs.azure.account.keyprovider.ilabsstoragevpn.blob.core.windows.net</name>
• <name>fs.azure.shellkeyprovider.script</name>

Configure yarn-site.xml for the Data Integration Service


Configure the Hortonworks cluster properties in the yarn-site.xml file that the Data Integration Service uses
when it runs mappings in a Hadoop cluster. You can use the Big Data Management Configuration Utility to
configure the yarn-site.xml file.

Open the yarn-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/

Configure the following properties in the yarn-site.xml file:


mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default value is 10020.

Get the value from the following file: /etc/hadoop/<version_number>/0/mapred-site.xml

For example:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>

mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888.

Get the value from the following file: /etc/hadoop/<version_number>/0/mapred-site.xml

For example:
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>

yarn.resourcemanager.scheduler.address
Scheduler interface address.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml

yarn.resourcemanager.webapp.address
Web application address for the Resource Manager.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 51


The following sample text shows the properties you can set in the yarn-site.xml file:
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:port</value>
<description>The address of the scheduler interface</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hostname:port</value>
<description>The address for the Resource Manager web application.</description>
</property>

Configure mapred-site.xml for the Data Integration Service


Configure the Hortonworks cluster properties in the mapred-site.xml file that the Data Integration Service
uses when it runs mappings in a Hadoop cluster.

Open the mapred-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/

Configure the following properties in the mapred-site.xml file:


mapreduce.jobhistory.intermediate-done-dir
Directory where the MapReduce jobs write history files.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

mapreduce.jobhistory.done-dir
Directory where the MapReduce JobHistory server manages history files.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

The following sample text shows the properties you must set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
<description>Directory where MapReduce jobs write history files.</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
<description>Directory where the MapReduce JobHistory server manages history
files.</description>
</property>

Configure the following properties in the mapred-site.xml:


mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server.

Use the value in the following file:/etc/hadoop/conf/mapred-site.xml

mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>

52 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>

Configure Rolling Upgrades for HDInsight


To enable support for rolling upgrades for HDInsight, you must configure the following properties in mapred-
site.xml on the machine where the Data Integration Service runs:
mapreduce.application.classpath
Classpaths for MapReduce applications.

Use the following value:


$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/
hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-
framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/
yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/
share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-
framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-
lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure

Replace ${hdp.version} with the version number of the Hortonworks HDInsights cluster.

mapreduce.application.framework.path
Path for the MapReduce framework archive.

Use the following value:


/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework

Replace ${hdp.version} with the version of the hortonworks HDInsights cluster.

The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>

Configure Big Data Management on the HDInsight Cluster


Use the Azure marketplace web site to configure Big Data Management on the HDInsight cluster. Big Data
Management uses an SQL Server 2014 database to operate with HDInsight.

1. Select Big Data Management for setup.


a. In the Azure marketplace, click the Add (+) button to create a new resource.
b. Search on "Informatica" to find Informatica offerings in the Azure marketplace.
c. Select Big Data Management Enterprise Edition <version_number> BYOL.
The "Create Big Data Management Enterprise Edition" tab opens. It displays all the steps necessary
to configure and launch Big Data Management on the cluster.
2. Supply information in the Basics panel, and then click OK.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 53


Subscription
Select the Azure subscription account that you want to use for Big Data Management.

Charges for this instance of Big Data Management will go to this subscription.

Location
Location of the resource group.

Select a resource group location.


3. Supply information in the Node Settings panel, and then click OK.
This tab allows you to configure details of the Informatica domain. Azure deploys the domain on an
Ubuntu Linux box.
Number of nodes in the domain.
Default is 2.

Machine prefix
Type an alphanumeric string that will be a prefix on the name of each virtual machine in the
Informatica domain.

For example, if you use the prefix "infa" then Azure will identify virtual machines in the domain with
this string at the beginning of the name.

Username
Username that you use to log in to the virtual machine that hosts the Informatica domain.

Authentication type
Authentication protocol you use to communicate with the Informatica domain.

Default is SSH Public Key.

Password
Password to use to log in to the virtual machine that hosts the Informatica domain.

Machine size
Select from among the available preconfigured VMs.
4. Supply information in the Domain Settings panel, and then click OK.
This tab allows you to configure additional details of the Informatica domain.
Informatica Domain Name
Type a name for the Informatica domain. This becomes the name of the Informatica domain on the
cluster.

Informatica domain administrator name


Login to use to administer the Informatica domain.

Password
Password for the Informatica administrator.

Keyphrase for encryption key


Create a keyphrase to create an encryption key.
5. Supply information in the Database Settings panel, and then click OK.
This tab allows you to configure settings for the storage where Informatica metadata will be stored.

54 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Database type
Select SQL Server 2014.

Database machine name


Name for the virtual machine that hosts the domain database.

Database machine size


Select a size from among the available preconfigured virtual machines.

Username
Username for the administrator of the virtual machine host of the database.

These credentials to log into the virtual machine where the database is hosted.

Password
Password for the database machine administrator.

Informatica Domain DB User


Name of the database user of the Model repository database.

The Informatica domain uses this account to communicate with the Model repository database.

Informatica Domain DB Password


Password for the database user.
6. Supply information in the Informatica Big Data Management Configuration panel, and then click OK.
This tab allows you to configure credentials that allow the Informatica domain to communicate with the
HDInsight cluster. Get the information for these settings from HDInsight cluster settings panels and the
Ambari cluster management tool.
HDInsight Cluster Hostname
Name of the HDInsight cluster where you want to create the Informatica domain.

HDInsight Cluster Login Username


User login for the cluster. This is typically the same login you use to log in to the Ambari cluster
management tool.

Password
Password for the HDInsight cluster user.

HDInsight Cluster SSH Hostname


Name of the cluster SSH host.

HDInsight Cluster SSH Username


Account name you use to log in to the cluster head node.

Password
Password to access the cluster SSH host.

The panel requires you to enter values for the following additional addresses. Get these addresses from
the Ambari cluster management tool:
• mapreduce.jobhistory.address
• mapreduce.jobhistory.webapp.address
• yarn.resourcemanager.scheduler.address
• yarn.resourcemanager.webapp.address

Configuring Big Data Management in the Azure HDInsight Cloud Environment 55


7. Supply information in the Infrastructure Settings panel, and then click OK.
Use this tab to set up cluster resources for the Big Data Management implementation.
Storage account
Storage resource that the virtual machines that run the Big Data Management implementation will
use for data storage.

Select an existing storage resource, or create a new one.

When you select an existing storage resource, verify that it belongs to the resource group you want.
It is not essential to select the same resource group as the group that the Big Data Management
implementation belongs to.

Virtual network
Virtual network for the Big Data Management implementation to belong to. Select the same network
as the one that you used to create the HDInsight cluster.

Subnets
The subnet that the virtual network contains.

Accept the default subnet.


8. Verify the choices in the Summary panel, and then click OK.
9. Read the terms of use in the Buy panel, and then click Create.
When you click Create, Azure deploys Big Data Management and creates resources in the environment
that you configured.

Configure Mappings to Run on the Hadoop Cluster


You can enable Informatica mappings to run on a Hadoop cluster on Hortonworks with HDinsight.

Informatica supports HDInsight clusters that are deployed on-premise on Microsoft Azure.

Note: If you do not use HiveServer2 to run mappings, skip the steps related to HiveServer2.

To enable Informatica mappings to run on a Hortonworks HDInsight cluster, complete the following steps:

1. Enable the Data Integration Service to use Hive CLI to run mappings.
2. Configure the mapping logic pushdown method.
3. Enable HBase support.
4. Create the HiveServer2 environment variables and configure the HiveServer2 environment.
5. Configure the Hadoop cluster for the Blaze engine.
6. Disable SQL standard based authorization to run mappings with HiveServer2.
7. Enable storage based authorization with HiveServer2.
8. Enable support for HBase with HiveServer2.

Enable the Data Integration Service to Use Hive CLI to Run Mappings
Perform the following tasks to enable the Data Integration Service to use Hive CLI to run mappings:

1. Copy the following files from the Hadoop cluster to the following location on the machine that hosts the
Data Integration Service: <Informatica_installation_directory>/hortonworks_2.3/lib
• /usr/hdp/<CurrentVersion>/hadoop/hadoop-azure-2.7.1.2.3.3.1-7.jar

56 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


• /usr/hdp/<CurrentVersion>/hadoop/lib/azure-storage-2.2.0.jar
• /usr/hdp/<CurrentVersion>/hadoop/lib/jetty-util-6.1.26.hwx.jar
2. Get the values of the following properties from the Ambari management browser:
• <name>fs.azure.account.key.ilabsstoragevpn.blob.core.windows.net</name>
• <name>fs.azure.account.keyprovider.ilabsstoragevpn.blob.core.windows.net</name>
• <name>fs.azure.shellkeyprovider.script</name>
3. Open the following file for editing:
<BigDataManagementInstallationDirectory>/Informatica/services/shared/hadoop/
hortonworks_2.3/conf/hive-site.xml
Add the properties from Step 3 to the hive-site.xml file, then save and close the file.

Configure the Mapping Logic Pushdown Method


You can use MapReduce or Tez to push mapping logic to the Hadoop cluster. You enable MapReduce or Tez
for the Data Integration Service or for a connection.

When you enable MapReduce or Tez for the Data Integration Service, that execution engine becomes the
default execution engine to push mapping logic to the Hadoop cluster. When you enable MapReduce or Tez
for a connection, that engine takes precedence over the execution engine set for the Data Integration
Service.

Choose MapReduce or Tez as the Execution Engine for the Data Integration Service
To use MapReduce or Tez as the default execution engine to push mapping logic to the Hadoop cluster,
perform the following steps:

1. Open hive-site.xml in the following directory on the node on which the Data Integration Service runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf/
2. Edit the hive.execution.engine property.
The following sample text shows the property in hive-site.xml:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
<description>Chooses execution engine. Options are: mr (MapReduce, default) or tez
(Hadoop 2 only)</description>
</property>
Set the value of the property as follows:
• mr -- Sets MapReduce as the execution engine.
• tez -- Sets Tez as the execution engine.

Enable Tez for a Hadoop or Hive Connection


When you enable Tez for a connection, the Data Integration Service uses Tez to push mapping logic to the
Hadoop cluster regardless of what is set for the Data Integration Service.

1. Open the Developer tool.


2. Click Window > Preferences.
3. Select Informatica > Connections.
4. Expand the domain.
5. Expand the Databases and select the Hadoop or Hive connection.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 57


6. Edit the connection and configure the Environment SQL property on the Database Connection tab.
Use the following value: set hive.execution.engine=tez;

If you enable Tez for the Data Integration Service but want to use MapReduce, you can use the following
value for the Environment SQL property: set hive.execution.engine=mr;.

Configure Tez
If you use Tez as the execution engine, you must configure properties in tez-site.xml.

You can find tez-site.xml in the following directory on the machine where the Data Integration Service
runs: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf.

Configure the following properties:


tez.lib.uris
Specifies the location of tez.tar.gz on the Hadoop cluster.

Use the value specified in tez-site.xml on the cluster. You can find tez-site.xml in the following
directory on any node in the cluster: /etc/tez/conf.

tez.am.launch.env
Specifies the location of Hadoop libraries.

Use the following syntax when you configure tez-site.xml:


<property>
<name>tez.lib.uris</name>
<value><file system default name>://<directory of tez.tar.gz></value>
<description>The location of tez.tar.gz. Set tez.lib.uris to point to the tar.gz
uploaded to HDFS.</description>
</property>
<property>
<name>tez.am.launch.env</name>
<value>LD_LIBRARY_PATH=<HDInsight directory>/<HDInsight version>/hadoop/lib/native</
value>
<description>The location of Hadoop libraries.</description>
</property>

The following example shows the properties if tez.tar.gz is in the /apps/tez/lib directory on HDFS:
<property>
<name>tez.lib.uris</name>
<value>hdfs://<Active_Name_Node>:8020/hdp/apps/<version>/tez/tez.tar.gz</value>
<description>The location of tez.tar.gz. Set tez.lib.uris to point to the tar.gz
uploaded to HDFS.</description>
</property>
<property>
<name>tez.am.launch.env</name>
<value>LD_LIBRARY_PATH=/usr/hdp/<hadoop_version>/hadoop/lib/native</value>
<description>The location of Hadoop libraries.</description>
</property>

Configure Tez for HiveServer2


If you use HiveServer2 to run mappings, open the tez-site.xml file. Verify that the following properties are
commented out:

• tez.am.launch.cmd-opts
• tez.task.launch.env
• tez.am.launch.env

58 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Enable HBase Support
To use HBase as a source or target when you run a mapping in the Hadoop environment, you must add
hbase-site.xml to a distributed cache.

Perform the following steps:

1. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/infaConf.
2. Edit hadoopEnv.properties.
3. Verify the HBase version specified in infapdo.env.entry.mapred_classpath uses the correct HBase
version for the Hadoop cluster.
The following sample text shows infapdo.env.entry.mapred_classpath for a Hadoop cluster that uses
HBase version 1.1.1.2.3.0.0-2504:
infapdo.env.entry.mapred_classpath=INFA_MAPRED_CLASSPATH=
$HADOOP_NODE_HADOOP_DIST/lib/hbase-server-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/htrace-core.jar:$HADOOP_NODE_HADOOP_DIST/lib/htrace-
core-2.04.jar:$HADOOP_NODE_HADOOP_DIST/lib/protobuf-java-2.5.0.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-client-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-common-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hive-hbase-handler-1.2.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-protocol-1.1.1.2.3.0.0-2504.jar
4. Add the following entry to the infapdo.aux.jars.path variable: file://$DIS_HADOOP_DIST/conf/
hbase-site.xml.
The following sample text shows infapdo.aux.jars.path with the variable added:
infapdo.aux.jars.path=file://$DIS_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$DIS_HADOOP_DIST/infaLib/hive-infa-plugins-interface.jar,file://
$DIS_HADOOP_DIST/infaLib/profiling-hive0.14.0-udf.jar,file://$DIS_HADOOP_DIST/
infaLib/hadoop2.2.0-avro_complex_file.jar,file://$DIS_HADOOP_DIST/conf/hbase-site.xml
5. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/conf.
6. In hbase-site.xml and hive-site.xml, verify that thezookeeper.znode.parent property exists and
matches the property set in hbase-site.xml on the cluster.
By default, the ZooKeeper directory on the cluster is /usr/hdp/current/hbase-client/conf.
7. On the machine where the Developer tool runs, go to the following directory: <Informatica installation
directory>\clients\DeveloperClient\hadoop\hortonworks_<version>/conf.
8. In hbase-site.xml and hive-site.xml, verify that thezookeeper.znode.parent property exists and
matches the property set in hbase-site.xml on the cluster.
By default, the ZooKeeper directory on the cluster is /usr/hdp/current/hbase-client/conf.
9. Edit the Hadoop classpath on every node on the Hadoop cluster to point to the hbase-protocol.jar file.
Then, restart the Node Manager for each node in the Hadoop cluster.
hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more
information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304

Configuring Big Data Management in the Azure HDInsight Cloud Environment 59


Create the HiveServer2 Environment Variables to Configure the HiveServer2
Environment
Before you can configure the HiveServer2 environment, create the required environment variables. Then
configure the HiveServer2 environment with Ambari or the hive-env.sh file.

You can run the Big Data Management Configuration Utility and select HiveServer2 to generate the
HiveServer2_EnvInfa.txt file. Alternatively, you can modify a template to create the required environment
variables.

Modify the following template:


export LD_LIBRARY_PATH=/opt/Informatica/services/shared/bin:/opt/Informatica/services/
shared/hadoop/hortonworks_2.3/lib/native:$LD_LIBRARY_PATH
export INFA_HADOOP_DIST_DIR=/opt/Informatica/services/shared/hadoop/hortonworks_2.3
export INFA_PLUGINS_HOME=/opt/Informatica/plugins

export TMP_INFA_AUX_JARS=$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.4.0-hdfs-native-impl.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.7.1.hw23-native-impl.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hbase1.1.2-infa-plugins.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-
boot.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-plugins.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-storagehandler.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hive0.14.0-native-impl.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive1.1.0-
avro_complex_file.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive-infa-plugins-interface.jar:
$INFA_HADOOP_DIST_DIR/infaLib/infa-hadoop-hdfs.jar:$INFA_HADOOP_DIST_DIR/infaLib/
profiling-hive0.14.0-udf.jar:/opt/Informatica/infa_jars.jar:$INFA_HADOOP_DIST_DIR/lib/
parquet-avro-1.6.0rc3.jar

if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then


export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH:$TMP_INFA_AUX_JARS
else
export HIVE_AUX_JARS_PATH=$TMP_INFA_AUX_JARS
fi

export JAVA_LIBRARY_PATH=/opt/Informatica/services/shared/bin
export INFA_RESOURCES=/opt/Informatica/services/shared/bin
export INFA_HOME=/opt/Informatica
export IMF_CPP_RESOURCE_PATH=/opt/Informatica/services/shared/bin
export
INFA_MAPRED_OSGI_CONFIG='osgi.framework.activeThreadType:false&:org.osgi.framework.stora
ge.clean:none&:eclipse.jobs.daemon:true&:infa.osgi.enable.workdir.reuse:true&:infa.osgi.
parent.workdir::/tmp/infa&:infa.osgi.workdir.poolsize:4'

In the template text above, replace the following text:

• Replace <HADOOP_NODE_INFA_HOME> with the Informatica installation directory on the HDInsight 3.3
cluster.
• Replace <HADOOP_DISTRIBUTION> with the Informatica Hadoop installation directory on the HDInsight
3.3 cluster.

Note: If you use Ambari with CSH as the default shell, you must change the export command to set.

After you create the environment variables, configure the HiveServer2 environment with Ambari or the hive-
env.sh file.

Configure the HiveServer2 Environment with Ambari


After you create the HiveServer2 environment variables, configure the HiveServer2 environment.

1. Open the modified template.


2. Copy the contents of the file.
Note: If you use Ambari with CSH as the default shell, you must change the export command to set.
3. Log in to Ambari.

60 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


4. Click Hive > Configs > Advanced.
5. Search for the "hive-env template" property.
6. Paste the contents of the modified template.
7. Save the changes.
8. Restart the HiveServer2 services.

Configure the HiveServer2 Environment with hive-env


After you create the HiveServer2 environment variables with the modified template, configure the
HiveServer2 environment with the hive-env.sh file.

1. Open the hive-env.sh file.


You can find hive-env.sh in the following directory: /etc/hive/conf/hive-env.sh.
2. Copy and paste the contents of the modified template at the end of hive-env.sh.
3. Restart HiveServer2 services.

Configure the Hadoop Cluster for the Blaze Engine


To use the Blaze engine, you must configure the Hadoop cluster.

Complete the following tasks:

• Configure the Hadoop cluster for the Application Timeline Server.


• Start the Application Timeline Server.
• Enable the Blaze Engine console.

Configure the Hadoop Cluster for the Application Timeline Server


You must configure properties in the yarn-site.xml file on the Hadoop cluster for the Application Timeline
Server.

You can use Ambari to configure the required properties in the yarn-site.xml file. Alternatively, configure
the yarn-site.xml file on every node in the Hadoop cluster.

You can find the yarn-site.xml file in the following directory on every node in the Hadoop cluster: /etc/
hadoop/conf.

Configure the following properties:


yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.

Set this value to true.

yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 61


Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.

Use 3600000.

yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.

The Blaze engine requires at least 6144 MB.

yarn.nodemanager.local-dirs
List of directories to store localized files in.

The Blaze engine uses local directories for a distributed cache.

The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>6144</value>
<description>Amount of physical memory that can be allotted for containers.</
description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>

Start the Application Timeline Server


The Blaze engine uses the Hadoop Application Timeline Server to store the job monitor status.

To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &

62 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Enable the Blaze Engine Console
Enable the Blaze engine console in the hadoopEnv.properties file.

1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.

Disable SQL Standard Based Authorization to Run Mappings with


HiveServer2
If the Hadoop cluster uses SQL standard based authorization, you must disable it to run mappings with
HiveServer2.

1. Log in to Ambari.
2. Select Hive > Configs.
3. In the Security section, set Hive Security Authorization to None.
4. Navigate to the Advanced tab for hiveserver2-site.
5. Set Enable Authorization to false.
6. Restart Hive Services.

Enable Storage Based Authorization with HiveServer2


Optionally, you can use storage-based authorization with HiveServer2.

1. Log in to Ambari.
2. Click Hive > Configs.
3. In the Security section, set the Hive Security Authorization to SQLStdAuth.
4. Navigate to Advanced Configs.
5. In the General section, verify that the Hive Authorization Manager property is set to the following value:
Hive Authorization Manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r,org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmb
edOnly
hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuth
orizerFactory
Enable Authorization
Set this property to True.
6. In the Advanced hiveserver2-site section, configure the following properties:

Configuring Big Data Management in the Azure HDInsight Cloud Environment 63


Enable Authorization
Set this value to True.

hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthoriz
erFactory
7. Restart all Hive services.

Enable Support for HBase with HiveServer2


You must configure Big Data Management to run a mapping that uses an HBase source or target with
HiveServer2.

Perform the following steps:

1. Verify that the value for the zookeeper.znode.parent property in the hbase-site.xml file on the
machine where the Data Integration Service runs matches the value on the Hadoop cluster.
The default value is /hbase-unsecure.
You can find the hbase-site.xml file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
hortonworks_<version>/conf.
You can find the hbase-site.xml file in the following directory on the Hadoop cluster: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>.
2. Verify that the infapdo.aux.jars.path property contains the path to the hbase-site.xml file.
The following sample text shows the infapdo.aux.jars.path property with the path for hbase-site.xml:
infapdo.aux.jars.path=file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive-infa-plugins-
interface.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/profiling-hive0.14.0-
udf.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hadoop2.2.0-
avro_complex_file.jar,file://$HADOOP_NODE_HADOOP_DIST/conf/hbase-site.xml,file://
$HADOOP_NODE_HADOOP_DIST/infaLib/infa_jars.jar

Configure and Start Informatica Services


After Azure finishes deploying the Informatica domain, use the Administrator tool to configure and start
Informatica services.

The Administrator tool is a browser-enabled utility that allows you to create, configure and run different
services on the Informatica domain.

To see more about Informatica services, see the Informatica Application Service Guide. You can download
this and all other documentation from the Informatica Network portal.

To access the Administrator tool, perform the following steps:

1. Get the host name, IP address, and port of the virtual machine where Azure deployed the Informatica
domain.
2. Add an entry for this domain host to the hosts file.

64 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


In the entry, type the host name and IP address. For example:
Informatica_host_name 1.2.3.4
3. Optionally, use a secure copy program to copy the license file to the domain host machine.
4. In a new browser tab, enter the following URL to start the Administrator tool:
https://<Informatica_host_name>:<port_number>
The default port number for the Administrator tool is 6008.
5. In the Administrator tool, upload the Big Data Management license file.
The Big Data management license file is required to use Big Data Management on HDInsight. Perform
the following steps:
a. Click Manage > Services and Nodes.
The Domain Navigator opens.
b. Select the Domain and click New > License.
The Create License window opens.
c. Supply a name for the license, then browse to select the license file.

6. Use the Administrator tool to create, configure and run application services on the Informatica domain.

Configure Settings on the Informatica Domain


After you install Big Data Management, edit the hdfs-site.xml file to support access to files in the HDInsight
wasb location.

u To change the value of the fs.DefaultFS property to the wasb location, edit the following file:
<Informatica_installation_directory>/services/shared/<hadoop_distribution>/conf/hive-
site.xml
You can get the wasb location from the hdfs-site.xml file on the Hadoop cluster, or through the Ambari
cluster management tool.

Connections
Define the connections that you want to use to access data in HBase, HDFS, Hive, or relational databases, or
run a mapping on a Hadoop cluster. You can create the connections using the Developer tool, Administrator
tool, and infacmd.

You can create the following types of connections:


Hadoop connection

Create a Hadoop connection to run mappings on the Hadoop cluster. Select the Hadoop connection if
you select the Hadoop run-time environment. You must also select the Hadoop connection to validate a
mapping to run on the Hadoop cluster. Before you run mappings in the Hadoop cluster, review the
information in this guide about rules and guidelines for mappings that you can run in the Hadoop cluster.

HDFS connection
Create an HDFS connection to read data from or write data to the HDFS file system on the Hadoop
cluster.

HBase connection
Create an HBase connection to access HBase. The HBase connection is a NoSQL connection.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 65


Hive connection
Create a Hive connection to access Hive as a source or target. You can access Hive as a source if the
mapping is enabled for the native or Hadoop environment. You can access Hive as a target only if the
mapping uses the Hive engine.

JDBC connection
Create a JDBC connection and configure Sqoop properties in the connection to import and export
relational data through Sqoop. You must also create a Hadoop connection to run the mapping on the
Hadoop cluster.

Note: For information about creating connections to other sources or targets such as social media web sites
or Teradata, see the respective PowerExchange adapter user guide for information.

HDFS Connection Properties


Use a Hadoop File System (HDFS) connection to access data in the Hadoop cluster. The HDFS connection is
a file system type connection. You can create and manage an HDFS connection in the Administrator tool,
Analyst tool, or the Developer tool. HDFS connection properties are case sensitive unless otherwise noted.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes HDFS connection properties:

Property Description

Name Name of the connection. The name is not case sensitive and must be unique within the domain. The name
cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must
be 255 characters or less and must be unique in the domain. You cannot change this property after you
create the connection. Default value is the connection name.

Description The description of the connection. The description cannot exceed 765 characters.

Location The domain where you want to create the connection. Not valid for the Analyst tool.

Type The connection type. Default is Hadoop File System.

User Name User name to access HDFS.

NameNode URI The URI to access HDinsight-FS.


Use the following URI:
hdfs://

66 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


HBase Connection Properties
Use an HBase connection to access HBase. The HBase connection is a NoSQL connection. You can create
and manage an HBase connection in the Administrator tool or the Developer tool. HBase connection
properties are case sensitive unless otherwise noted.

The following table describes HBase connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within
the domain. You can change this property after you create the connection. The name
cannot exceed 128 characters, contain spaces, or contain the following special
characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not
case sensitive. It must be 255 characters or less and must be unique in the domain. You
cannot change this property after you create the connection. Default value is the
connection name.

Description The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select HBase.

ZooKeeper Host(s) Name of the machine that hosts the ZooKeeper server.

ZooKeeper Port Port number of the machine that hosts the ZooKeeper server.
Use the value specified for hbase.zookeeper.property.clientPort in hbase-
site.xml. You can find hbase-site.xml on the Namenode machine in the following
directory: /opt/HDinsight/hbase/hbase-0.98.7/conf

Enable Kerberos Connection Enables the Informatica domain to communicate with the HBase master server or region
server that uses Kerberos authentication.

HBase Master Principal Service Principal Name (SPN) of the HBase master server. Enables the ZooKeeper
server to communicate with an HBase master server that uses Kerberos authentication.
Enter a string in the following format:
hbase/<domain.name>@<YOUR-REALM>
Where:
- domain.name is the domain name of the machine that hosts the HBase master server.
- YOUR-REALM is the Kerberos realm.

HBase Region Server Principal Service Principal Name (SPN) of the HBase region server. Enables the ZooKeeper
server to communicate with an HBase region server that uses Kerberos authentication.
Enter a string in the following format:
hbase_rs/<domain.name>@<YOUR-REALM>
Where:
- domain.name is the domain name of the machine that hosts the HBase master server.
- YOUR-REALM is the Kerberos realm.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 67


Hive Connection Properties
Use the Hive connection to access Hive data. A Hive connection is a database type connection. You can
create and manage a Hive connection in the Administrator tool, Analyst tool, or the Developer tool. Hive
connection properties are case sensitive unless otherwise noted.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes Hive connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within
the domain. You can change this property after you create the connection. The name
cannot exceed 128 characters, contain spaces, or contain the following special
characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not
case sensitive. It must be 255 characters or less and must be unique in the domain. You
cannot change this property after you create the connection. Default value is the
connection name.

Description The description of the connection. The description cannot exceed 4000 characters.

Location The domain where you want to create the connection. Not valid for the Analyst tool.

Type The connection type. Select Hive.

Connection Modes Hive connection mode. Select at least one of the following options:
- Access Hive as a source or target. Select this option if you want to use the connection
to access the Hive data warehouse. If you want to use Hive as a target, you must
enable the same connection or another Hive connection to run mappings in the
Hadoop cluster.
- Use Hive to run mappings in Hadoop cluster. Select this option if you want to use the
connection to run mappings in the Hadoop cluster.
You can select both the options. Default is Access Hive as a source or target.

68 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Property Description

User Name User name of the user that the Data Integration Service impersonates to run mappings
on a Hadoop cluster.
Use the user name of an operating system user that is present on all nodes on the
Hadoop cluster.

Common Attributes to Both the SQL commands to set the Hadoop environment. In native environment type, the Data
Modes: Environment SQL Integration Service executes the environment SQL each time it creates a connection to a
Hive metastore. If you use the Hive connection to run mappings in the Hadoop cluster,
the Data Integration Service executes the environment SQL at the beginning of each
Hive session.
The following rules and guidelines apply to the usage of environment SQL in both
connection modes:
- Use the environment SQL to specify Hive queries.
- Use the environment SQL to set the classpath for Hive user-defined functions and
then use environment SQL or PreSQL to specify the Hive user-defined functions. You
cannot use PreSQL in the data object properties to specify the classpath. The path
must be the fully qualified path to the JAR files used for user-defined functions. Set
the parameter hive.aux.jars.path with all the entries in infapdo.aux.jars.path and the
path to the JAR files for user-defined functions.
- You can use environment SQL to define Hadoop or Hive parameters that you want to
use in the PreSQL commands or in custom queries.
If you use the Hive connection to run mappings in the Hadoop cluster, the Data
Integration service executes only the environment SQL of the Hive connection. If the
Hive sources and targets are on different clusters, the Data Integration Service does not
execute the different environment SQL commands for the connections of the Hive source
or target.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 69


Properties to Access Hive as Source or Target
The following table describes the connection properties that you configure to access Hive as a source or
target:

Property Description

Metadata The JDBC connection URI used to access the metadata from the Hadoop server.
Connection You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service.
String
To connect to HiveServer2, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>;transportMode=<mode>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database name, the Data
Integration Service uses the default database details.
- <mode> is the value of the hive.server2.transport.mode property in the Hive tab of the Ambari tool.

Bypass Hive JDBC driver mode. Select the check box to use the embedded JDBC driver mode.
JDBC Server To use the JDBC embedded mode, perform the following tasks:
- Verify that Hive client and Informatica services are installed on the same machine.
- Configure the Hive connection properties to run mappings in the Hadoop cluster.
If you choose the non-embedded mode, you must configure the Data Access Connection String.
Informatica recommends that you use the JDBC embedded mode.

Data Access The JDBC connection URI used to access data from the Hadoop server.
Connection You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service.
String
To connect to HiveServer2, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>;transportMode=<mode>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database name, the Data
Integration Service uses the default database details.
- <mode> is the value of the hive.server2.transport.mode property in the Hive tab of the Ambari tool.

70 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Properties to Run Mappings in Hadoop Cluster
The following table describes the Hive connection properties that you configure when you want to use the
Hive connection to run Informatica mappings in the Hadoop cluster:

Property Description

Database Name Namespace for tables. Use the name default for tables that do not have a specified
database name.

Default FS URI The URI to access the default HDinsight File System.
Use the connection URI that matches the storage type. The storage type is configured for
the cluster in the fs.defaultFS property.
If the cluster uses HDFS storage, use the following string to specify the URI:
hdfs://<cluster_name>
Example:
hdfs://my-cluster
If the cluster uses wasb storage, use the following string to specify the URI:
wasb://<container_name>@<account_name>.blob.core.windows.net/
<path>
where:
- <container_name> identifies a specific Azure Blob storage container.
Note: <container_name> is optional.
- <account_name> identifies the the Azure storage object.
Example:
wasb://infabdmoffering1storage.blob.core.windows.net/
infabdmoffering1cluster/mr-history

Yarn Resource Manager URI The service within Hadoop that submits the MapReduce tasks to specific nodes in the
cluster.
For HDInsight 3.3 with YARN, use the following format:
<hostname>:<port>
Where
- <hostname> is the host name or IP address of the JobTracker or Yarn resource
manager.
- <port> is the port on which the JobTracker or Yarn resource manager listens for
remote procedure calls (RPC).
Use the value specified by yarn.resourcemanager.address in yarn-site.xml.
You can find yarn-site.xml in the following directory on the NameNode: /etc/
hive/<version>/0/.
For HDInsight 3.3 with MapReduce 2, use the following URI:
hdfs://host:port

Hive Warehouse Directory on The absolute HDFS file path of the default database for the warehouse that is local to the
HDFS cluster. For example, the following file path specifies a local warehouse:
/user/hive/warehouse
If the Metastore Execution Mode is remote, then the file path must match the file path
specified by the Hive Metastore Service on the hadoop cluster.
Use the value specified for the hive.metastore.warehouse.dir property in
hive-site.xml. You can find yarn-site.xml in the following directory on the node
that runs HiveServer2: /etc/hive/<version>/0/.

Configuring Big Data Management in the Azure HDInsight Cloud Environment 71


Property Description

Advanced Hive/Hadoop Properties Configures or overrides Hive or Hadoop cluster properties in hive-site.xml on the
machine on which the Data Integration Service runs. You can specify multiple properties.
Use the following format:
<property1>=<value>
Where
- <property1> is a Hive or Hadoop property in hive-site.xml.
- <value> is the value of the Hive or Hadoop property.
To specify multiple properties use &: as the property separator.
The maximum length for the format is 1 MB.
If you enter a required property for a Hive connection, it overrides the property that you
configure in the Advanced Hive/Hadoop Properties.
The Data Integration Service adds or sets these properties for each map-reduce job. You
can verify these properties in the JobConf of each mapper and reducer job. Access the
JobConf of each job from the Jobtracker URL under each map-reduce job.
The Data Integration Service writes messages for these properties to the Data Integration
Service logs. The Data Integration Service must have the log tracing level set to log each
row or have the log tracing level set to verbose initialization tracing.
For example, specify the following properties to control and limit the number of reducers
to run a mapping job:
mapred.reduce.tasks=2&:hive.exec.reducers.max=10

Temporary Table Compression Hadoop compression library for a compression codec class name.
Codec

Codec Class Name Codec class name that enables data compression and improves performance on
temporary staging tables.

Metastore Execution Mode Controls whether to connect to a remote metastore or a local metastore. By default, local
is selected. For a local metastore, you must specify the Metastore Database URI, Driver,
Username, and Password. For a remote metastore, you must specify only the Remote
Metastore URI.

Metastore Database URI The JDBC connection URI used to access the data store in a local metastore setup. Use
the following connection URI:
jdbc:<datastore type>://<node name>:<port>/<database name>
where
- <node name> is the host name or IP address of the data store.
- <data store type> is the type of the data store.
- <port> is the port on which the data store listens for remote procedure calls (RPC).
- <database name> is the name of the database.
For example, the following URI specifies a local metastore that uses MySQL as a data
store:
jdbc:mysql://hostname23:3306/metastore
Use the value specified for the javax.jdo.option.ConnectionURL property in
hive-site.xml. You can find hive-site.xml in the following directory on the node
that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.

Metastore Database Driver Driver class name for the JDBC data store. For example, the following class name
specifies a MySQL driver:
Use the value specified for the javax.jdo.option.ConnectionDriverName
property in hive-site.xml. You can find hive-site.xml in the following directory
on the node that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.

72 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Property Description

Metastore Database Username The metastore database user name.


Use the value specified for the javax.jdo.option.ConnectionUserName
property in hive-site.xml. You can find hive-site.xml in the following directory
on the node that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.

Metastore Database Password Required if the Metastore Execution Mode is set to local. The password for the metastore
user name.
Use the value specified for the javax.jdo.option.ConnectionPassword
property in hive-site.xml. You can find hive-site.xml in the following directory
on the node that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.

Remote Metastore URI The metastore URI used to access metadata in a remote metastore setup. For a remote
metastore, you must specify the Thrift server details.
Use the following connection URI:
thrift://<hostname>:<port>
Where
- <hostname> is name or IP address of the Thrift metastore server.
- <port> is the port on which the Thrift server is listening.
Use the value specified for the hive.metastore.uris property in hive-site.xml.
You can find hive-site.xml in the following directory on the node that runs
HiveServer2: /etc/hive/<version>/0/hive-site.xml.

Hive Connection String The JDBC connection URI used to access the metadata from the Hadoop server.
You can use PowerExchange for Hive to communicate with a HiveServer service or
HiveServer2 service.
To connect to HiveServer2, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>;transportMode=<mode>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database
name, the Data Integration Service uses the default database details.
- <mode> is the value of the hive.server2.transport.mode property in the Hive tab of the
Ambari tool.

Configuring Big Data Management in the Cloudera


Environment
You can enable Informatica mappings to run on a Hadoop cluster on Cloudera CDH.

Informatica supports Cloudera CDH clusters that are deployed on-premise, on Amazon EC2, or on Microsoft
Azure.

To enable Informatica mappings to run on a Cloudera CDH cluster, complete the following steps:

Note: If you do not use HiveServer2 to run mappings, skip the HiveServer2 related steps.

1. Configure Hadoop cluster properties on the machine on which the Data Integration Service runs.
2. Create a staging directory on HDFS.

Configuring Big Data Management in the Cloudera Environment 73


3. Configure virtual memory limits.
4. Add hbase_protocol.jar to the Hadoop classpath.
5. Configure the Blaze engine.
6. Optionally, configure the HiveServer2 environment
7. Optionally, configure DB2 partitioning with HiveServer2.
8. Optionally, disable SQL-based authorization for HiveServer2.

Configure Hadoop Cluster Properties on the Data Integration


Service Machine
Configure Hadoop cluster properties in the hive-site.xml and yarn-site.xml files that the Data Integration
Service uses when it runs mappings on a Cloudera CDH cluster.

Configure hive-site.xml for the Data Integration Service


hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:

<Informatica installation directory>/services/shared/hadoop/cloudera_cdh<version>/conf

In hive-site.xml, configure the following property:


hive.optimize.constant.propagation
Whether to enable the constant propagation optimizer.

Set this value to false.

The following sample code describes the properties you can set in hive-site.xml:
<property>
<name>hive.optimize.constant.propagation</name>
<value>false</value>
</property>

Configure yarn-site.xml for the Data Integration Service


The yarn-site.xml file is located in the following directory on the machine where the Data Integration
Service runs:

<Informatica installation directory>/services/shared/hadoop/cloudera_cdh<version>/conf

Configure the following Hadoop cluster properties:


yarn.resourcemanager.webapp.address
Web application address for the Resource Manager.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml.

yarn.application.classpath
Required if you used the Big Data Management Configuration Utility. A comma-separated list of
CLASSPATH entries for YARN applications.

Use the following values:


$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,
$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*

Alternatively, you can use the value for this property from yarn-site.xml on the Hadoop cluster.

The Big Data Management Configuration utility automatically configures the following properties in the yarn-
site.xml file. You can also manually configure the properties.

74 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server.

Use the value in the following file:/etc/hadoop/conf/mapred-site.xml

mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

yarn.resourcemanager.scheduler.address
Scheduler interface address.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml

You can set the following properties in yarn-site.xml:


<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:port</value>
<description>The address of the scheduler interface</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hostname:port</value>
<description>The address for the Resource Manager web application.</description>
</property>

Create a Staging Directory on HDFS


If the Cloudera cluster uses HiveServer 2, you must grant the anonymous user the Execute permission on the
staging directory or you must create another staging directory on HDFS.

By default, a staging directory already exists on HDFS. You must grant the anonymous user the Execute
permission on the staging directory. If you cannot grant the anonymous user the Execute permission on this
directory, you must enter a valid user name for the user in the Hive connection. If you use the default staging
directory on HDFS, you do not have to configure mapred-site.xml or hive-site.xml.

If you want to create another staging directory to store mapreduce jobs, you must create a directory on
HDFS. After you create the staging directory, you must add it to mapred-site.xml and hive-site.xml.

To create another staging directory on HDFS, run the following commands from the command line of the
machine that runs the Hadoop cluster:
hadoop fs –mkdir /staging
hadoop fs –chmod –R 0777 /staging

Add the staging directory to mapred-site.xml.

mapred-site.xml is located in the following directory on the Hadoop cluster: /etc/hadoop/conf/mapred-


site.xml

Configuring Big Data Management in the Cloudera Environment 75


For example, mapred-site.xml, add the following entry to mapred-site.xml:
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/staging</value>
</property>

Add the staging directory to hive-site.xml on the machine where the Data Integration Service runs.

hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:
<Informatica installation directory>/services/shared/adhoop/cloudera_<version>/conf.

In hive-site.xml, add the yarn.app.mapreduce.am.staging-dir property. Use the value that you specified
in mapred-site.xml.

For example, add the following entry to hive-site.xml:


<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/staging</value>
</property>

Configure Virtual Memory Limits


Configure the virtual memory limits in yarn-site.xml for every node in the Hadoop cluster. After you configure
virtual memory limits you must restart the Hadoop cluster.

yarn-site.xml is located in the following directory on every node in the Hadoop cluster:

/etc/hadoop/conf/yarn-site.xml

In yarn-site.xml, configure the following property:


yarn.nodemanager.vmem-check-enabled
Determines virtual memory limits.

The following example describes the property you can configure in yarn-site.xml:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Enforces virtual memory limits for containers.</description>
</property>

Add hbase_protocol.jar to the Hadoop classpath


Add a reference to the hbase-protocol.jar file to the Hadoop classpath on every node on the Hadoop
cluster.

Edit the Hadoop classpath on every node on the Hadoop cluster to point to the hbase-protocol.jar file.
Then, restart the Node Manager for each node in the Hadoop cluster.

hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more
information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304

Configure the HiveServer2 Environment


To use HiveServer2 to run mappings, you must configure the HiveServer2 environment with the environment
file that the Big Data Management Configuration Utility generates.

You can use Cloudera Manager to configure the HiveServer2 environment. Alternatively, you can copy the
contents HiveServer2_EnvInfa.txt to the end of the hive-env.sh file.

76 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


When you run the configuration utility to configure Big Data Edition in HiveServer2, the utility creates the
HiveServer2_EnvInfa.txt file in the following directory on the machine where the Data Integration Service
runs if you used the Big Data Management Configuration Utility: <Informatica installation directory>/
tools/BDMUtil.

You can find hive-env.sh in the following directory on the Hadoop cluster: /etc/hive/conf/hive-env.sh.

1. Open the HiveServer2_EnvInfa.txt file.


2. Find and copy the HIVE_AUX_JARS_PATH property.
3. Log in to Cloudera Manager.
4. Click on Hive > Configuration.
5. Search for the following property in Filters: Hive Auxiliary JARs Directory.
6. Paste the value of HIVE_AUX_JARS_PATH from HiveServer2_EnvInfa.txt to the Hive Auxiliary JARs
Directory property.
7. Search for the following property in Filters: HiveServer2 Environment Advanced Configuration Snippet
(Safety Valve).
8. Copy the contents of HiveServer2_EnvInfa.txt except for the HIVE_AUX_JARS_PATH property.
9. In the HiveServer2 Default Group, paste the contents of HiveServer2_EnvInfa.txt.
10. Click Save Changes.
11. Restart Hive services.
Click Restart > Review Changes > Restart cluster.
12. Select Rolling Restart > Hive > Restart Now.
13. Click Finish.

Configure the Hadoop Cluster for the Blaze Engine


To use the Blaze engine, you must configure the Hadoop cluster.

To configure a Cloudera CDH cluster for the Blaze engine, complete the following tasks:

• Configure Blaze on Kerberos-enabled clusters


• Configure yarn-site.xml on every node in the Hadoop cluster.
• Prepare and start the Application Timeline Server.
• Enable the Blaze Engine console.

Configure Blaze on Kerberos-Enabled Clusters


To use the Blaze runtime engine on Kerberos-enabled clusters, perform the following steps:

1. Copy the following files from the cluster name node:


• core-site.xml
• hdfs-site.xml
2. Paste the files to the following directory on the VM that runs the Data Integration Service:
<InformaticaInstallationDir>/services/shared/hadoop/<HadoopDistributionName>/conf

Configuring Big Data Management in the Cloudera Environment 77


Configure yarn-site XML for the Application Timeline Server
You must configure properties in the yarn-site.xml file on every node in the Hadoop cluster to enable the
Application Timeline Server.

Configure the following properties:


yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.

Set this value to true.

yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.

Use 3600000.

yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.

The Blaze engine requires at least 6144 MB.

yarn.nodemanager.local-dirs
List of directories to store localized files in.

The Blaze engine uses local directories for a distributed cache.

The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>

78 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>6144</value>
<description>Amount of physical memory that can be allotted for containers.</
description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>

Prepare and Start the Application Timeline Server


1. Log in to the machine where you want to start the Application Timeline Server.
2. Create a directory to store configuration files.
3. Copy the following files from the cluster and paste them into the directory that you created:
• core-site.xml
• yarn-site.xml
4. If the Cloudera cluster is Kerberos-enabled, open core-site.xml for editing and locate the
hadoop.security.authentication property. Change the value of the hadoop.security.authentication
property from Kerberos to simple. The resulting property looks like the following:
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
5. Copy the following files to the directory that you created in step 2:
• jackson-xc-1.9.2.jar
• jackson-jaxrs-1.9.2.jar
• jackson-core-asl-1.9.2.jar
• jackson-mapper-asl-1.9.12.jar
You can find the files in the following location:
/opt/cloudera/parcels/CDH<version>/lib/hadoop/libexec/ ../ ../hadoop/lib/
6. Set the following environment variables before you start the Application Timeline Server:
export YARN_USER_CLASSPATH_FIRST=true
export YARN_USER_CLASSPATH=<new_directory_path>:<new_directory_path>/*
where <new_directory_path> is the path to the directory that you created in step 2.
7. To start the Application Timeline Server, run the following command on any node in the Hadoop cluster:
sudo yarn timelineserver &

Enable the Blaze Engine Console


Enable the Blaze engine console in the hadoopEnv.properties file.

1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.

Configuring Big Data Management in the Cloudera Environment 79


Configure HiveServer2 for DB2 Partitioning
To use database partitioning for DB2 with HiveServer2, use Cloudera Manager to configure the
LD_LIBRARY_PATH for HiveServer2.

Note: If the Hadoop cluster uses RPMs, you must manually edit the hive-env.sh file to add the
<DB2_HOME>/lib64 directory to LD_LIBRARY_PATH. You can find hive-env.sh in the following
directory: /etc/hive/conf

1. Open Cloudera Manager.


2. Click Hive > Configuration.
3. Search for the following property: HiveServer2 Environment Advanced Configuration Snippet.
4. Add the following directory to LD_LIBRARY_PATH: <DB2_HOME>/lib64.
5. Restart Hive services.

Disable SQL Standard Based Authorization for HiveServer2


If the Hadoop cluster uses SQL standard or Sentry based authorization, you must disable them to run
mappings with HiveServer2.

1. Disable SQL standard based authorization and Sentry based authorization.


2. Restart the Hive services.

Configuring Big Data Management in the


Hortonworks HDP Environment
You can enable Informatica mappings to run on a Hadoop cluster on Hortonworks HDP.

Informatica supports Hortonworks HDP clusters that are deployed on-premise, on Amazon EC2, or on
Microsoft Azure.

To enable Informatica mappings to run on a Hortonworks HDP cluster, complete the following steps:

Note: Skip the HiveServer2 related steps if you do not use HiveServer2 to run mappings.

1. Configure Hadoop Cluster Properties for the Data Integration Service.


2. Optionally, enable Tez.
3. Add hbase_protocol.jar to the Hadoop classpath.
4. Configure the Blaze engine.
5. Create the HiveServer2 environment variables and configure the HiveServer2 environment.
6. Disable SQL standard based authorization to run mappings with HiveServer2.
7. Enable support for HBase with HiveServer2.
8. Configure HiveServer2 for DB2 partitioning.

80 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Configure Hadoop Cluster Properties for the Data Integration
Service
Configure Hadoop cluster properties in the yarn-site.xml file and mapred-site.xml file that the Data
Integration Service uses when it runs mappings on a Hortonworks HDP cluster

Configure hive-site.xml for the Data Integration Service


Note: You can skip this section if you do not use HiveServer2 to run mappings.

You need to configure the Hortonworks cluster properties in the hive-site.xml file that the Data Integration
Service uses when it runs mappings in a Hadoop cluster. If you use the Big Data Management Configuration
Utility to configure Big Data Management, the hive-site.xml file is automatically configured.

Open the hive-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/

Configure the following property in the hive-site.xml file:


hive.metastore.uris
URI for the metastore host.

The following sample text shows the property you can configure in the hive-site.xml file:
<property>
<name>hive.metastore.uris</name>
<value>thrift://hostname:port</value>
</property>

Configure yarn-site.xml for the Data Integration Service


You need to configure the Hortonworks cluster properties in the yarn-site.xml file that the Data Integration
Service uses when it runs mappings in a Hadoop cluster. If you use the Big Data Management Configuration
Utility to configure Big Data Management, the yarn-site.xml file is automatically configured.

Open the yarn-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/

Configure the following properties in the yarn-site.xml file:


yarn.resourcemanager.scheduler.address
Scheduler interface address.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml

yarn.resourcemanager.webapp.address
Web application address for the Resource Manager.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml.

If the HDInsight cluster uses MapReduce 2, configure the following properties in the yarn-site.xml file:
mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default value is 10020.

Use the value in the following file: /etc/hdaoop/<version>/0/mapred-site.xml

mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888.

Configuring Big Data Management in the Hortonworks HDP Environment 81


Use the value in the following file: /etc/hdaoop/<version>/0/mapred-site.xml

yarn.resourcemanager.scheduler.address
Scheduler interface address. The default value is 8030.

Use the value in the following file: /etc/hdaoop/<version>/0/yarn-site.xml

yarn.resourcemanager.webapp.address
Resource Manager web application address.

Use the value in the following file: /etc/hdaoop/<version>/0/yarn-site.xml

The following sample text shows the properties you can set in the yarn-site.xml file:
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:port</value>
<description>The address of the scheduler interface</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hostname:port</value>
<description>The address for the Resource Manager web application.</description>
</property>

Configure mapred-site.xml for the Data Integration Service


You need to configure the Hortonworks cluster properties in the mapred-site.xml file that the Data
Integration Service uses when it runs mappings in a Hadoop cluster.

Open the mapred-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/

Configure the following properties in the mapred-site.xml file:


mapreduce.jobhistory.intermediate-done-dir
Directory where the MapReduce jobs write history files.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

mapreduce.jobhistory.done-dir
Directory where the MapReduce JobHistory server manages history files.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

The following sample text shows the properties you must set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
<description>Directory where MapReduce jobs write history files.</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
<description>Directory where the MapReduce JobHistory server manages history
files.</description>
</property>

If you use the Big Data Management Configuration Utility to configure Big Data Management, the following
properties are automatically configured in mapred-site.xml. If you do not use the utility, configure the
following properties in mapred-site.xml:

82 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server.

Use the value in the following file:/etc/hadoop/conf/mapred-site.xml

mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>

Configure Rolling Upgrades for Hortonworks HDP


To enable support for rolling upgrades for Hortonworks HDP, you must configure the following properties in
mapred-site.xml on the machine where the Data Integration Service runs:
mapreduce.application.classpath
Classpaths for MapReduce applications.

Use the following value:


$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/
hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-
framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/
yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/
share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/
<hadoop_version>/hadoop/lib/hadoop-lzo-0.6.0.2.2.0.0-2041.jar:/etc/hadoop/conf/secure

Replace <hadoop_version> with your Hortonworks HDP version. For example, use 2.2.0.0-2041 for a
Hortonworks HDP 2.2 cluster.

mapreduce.application.framework.path
Path for the MapReduce framework archive.

Use the following value:


/hdp/apps/<hadoop_version>/mapreduce/mapreduce.tar.gz#mr-framework

Replace <hadoop_version> with your Hortonworks HDP version. For example, use 2.2.0.0-2041 for a
Hortonworks HDP 2.2 cluster.

The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.application.classpath</name>
<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/
hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/
hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-
framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:
$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/<hadoop_version>/hadoop/lib/
hadoop-lzo-0.6.0.2.2.0.0-2041.jar:/etc/hadoop/conf/secure
</value>
<description>Classpaths for MapReduce applications. Replace <hadoop_version> with your
Hortonworks HDP version. For example, use 2.2.0.0-2041 for a Hortonworks HDP 2.2
cluster.

Configuring Big Data Management in the Hortonworks HDP Environment 83


</description>
</property>
<property>
<name>mapreduce.application.framework.path</name>
<value>/hdp/apps/<hadoop_version>/mapreduce/mapreduce.tar.gz#mr-framework</value>
<description> Path for the MapReduce framework archive. Replace <hadoop_version> with
your Hortonworks HDP version. For example, use 2.2.0.0-2041 for a Hortonworks HDP 2.2
cluster.
</description>
</property>

Configure the Mapping Logic Pushdown Method


You can use MapReduce or Tez to push mapping logic to the Hadoop cluster. You enable MapReduce or Tez
for the Data Integration Service or for a connection.

When you enable MapReduce or Tez for the Data Integration Service, that execution engine becomes the
default execution engine to push mapping logic to the Hadoop cluster. When you enable MapReduce or Tez
for a connection, that engine takes precedence over the execution engine set for the Data Integration
Service.

Choose MapReduce or Tez as the Execution Engine for the Data Integration Service
To use MapReduce or Tez as the default execution engine to push mapping logic to the Hadoop cluster,
perform the following steps:

1. Open hive-site.xml in the following directory on the node on which the Data Integration Service runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf/
2. Edit the hive.execution.engine property.
The following sample text shows the property in hive-site.xml:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
<description>Chooses execution engine. Options are: mr (MapReduce, default) or tez
(Hadoop 2 only)</description>
</property>
Set the value of the property as follows:
• mr -- Sets MapReduce as the execution engine.
• tez -- Sets Tez as the execution engine.

Enable Tez for a Hadoop or Hive Connection


When you enable Tez for a connection, the Data Integration Service uses Tez to push mapping logic to the
Hadoop cluster regardless of what is set for the Data Integration Service.

1. Open the Developer tool.


2. Click Window > Preferences.
3. Select Informatica > Connections.
4. Expand the domain.
5. Expand the Databases and select the Hadoop or Hive connection.
6. Edit the connection and configure the Environment SQL property on the Database Connection tab.
Use the following value: set hive.execution.engine=tez;

If you enable Tez for the Data Integration Service but want to use MapReduce, you can use the following
value for the Environment SQL property: set hive.execution.engine=mr;.

84 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Configure Tez
If you use Tez as the execution engine, you must configure properties in tez-site.xml.

You can find tez-site.xml in the following directory on the machine where the Data Integration Service
runs: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf.

Configure the following properties:


tez.lib.uris
Specifies the location of tez.tar.gz on the Hadoop cluster.

Use the value specified in tez-site.xml on the cluster. You can find tez-site.xml in the following
directory on any node in the cluster: /etc/tez/conf.

tez.am.launch.env
Specifies the location of Hadoop libraries.

Use the following syntax when you configure tez-site.xml:


<property>
<name>tez.lib.uris</name>
<value><file system default name>://<directory of tez.tar.gz></value>
<description>The location of tez.tar.gz. Set tez.lib.uris to point to the tar.gz
uploaded to HDFS.</description>
</property>
<property>
<name>tez.am.launch.env</name>
<value>LD_LIBRARY_PATH=<HDInsight directory>/<HDInsight version>/hadoop/lib/native</
value>
<description>The location of Hadoop libraries.</description>
</property>

The following example shows the properties if tez.tar.gz is in the /apps/tez/lib directory on HDFS:
<property>
<name>tez.lib.uris</name>
<value>hdfs://<Active_Name_Node>:8020/hdp/apps/<version>/tez/tez.tar.gz</value>
<description>The location of tez.tar.gz. Set tez.lib.uris to point to the tar.gz
uploaded to HDFS.</description>
</property>
<property>
<name>tez.am.launch.env</name>
<value>LD_LIBRARY_PATH=/usr/hdp/<hadoop_version>/hadoop/lib/native</value>
<description>The location of Hadoop libraries.</description>
</property>

Configure Tez for HiveServer2


If you use HiveServer2 to run mappings, open the tez-site.xml file. Verify that the following properties are
commented out:

• tez.am.launch.cmd-opts
• tez.task.launch.env
• tez.am.launch.env

Add hbase_protocol.jar to the Hadoop classpath


Add a reference to the hbase-protocol.jar file to the Hadoop classpath on every node on the Hadoop
cluster.

Edit the Hadoop classpath on every node on the Hadoop cluster to point to the hbase-protocol.jar file.
Then, restart the Node Manager for each node in the Hadoop cluster.

Configuring Big Data Management in the Hortonworks HDP Environment 85


hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more
information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304

HiveServer 2 Configuration Tasks


When you want to use HiveServer 2 to run mappings, you perform the following configuration tasks.

Create the HiveServer2 Environment Variables


Before you can configure the HiveServer2 environment, create the required environment variables.

You can run the Big Data Management Configuration Utility and select HiveServer2 to generate the
HiveServer2_EnvInfa.txt file. Alternatively, you can modify a template to create the required environment
variables.

Modify the following template:


export LD_LIBRARY_PATH=/<HADOOP_NODE_INFA_HOME>/services/shared/
bin:<HADOOP_NODE_INFA_HOME>/services/shared/hadoop/<HADOOP_DISTRIBUTION>/lib/native:
$LD_LIBRARY_PATH
export INFA_HADOOP_DIST_DIR=<HADOOP_NODE_INFA_HOME>/services/shared/hadoop/
<HADOOP_DISTRIBUTION>
export INFA_PLUGINS_HOME=<HADOOP_NODE_INFA_HOME>/Informatica/plugins

export TMP_INFA_AUX_JARS=$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.4.0-hdfs-native-impl.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.7.1.hw23-native-impl.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hbase1.1.2-infa-plugins.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-
boot.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-plugins.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-storagehandler.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hive0.14.0-native-impl.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive1.1.0-
avro_complex_file.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive-infa-plugins-interface.jar:
$INFA_HADOOP_DIST_DIR/infaLib/infa-hadoop-hdfs.jar:$INFA_HADOOP_DIST_DIR/infaLib/
profiling-hive0.14.0-udf.jar:/opt/Informatica/infa_jars.jar

if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then


export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH:$TMP_INFA_AUX_JARS
else
export HIVE_AUX_JARS_PATH=$TMP_INFA_AUX_JARS
fi

export JAVA_LIBRARY_PATH=<HADOOP_NODE_INFA_HOME>/services/shared/bin
export INFA_RESOURCES=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export INFA_HOME=<HADOOP_NODE_INFA_HOME>
export IMF_CPP_RESOURCE_PATH=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export
INFA_MAPRED_OSGI_CONFIG='osgi.framework.activeThreadType:false&:org.osgi.framework.stora
ge.clean:none&:eclipse.jobs.daemon:true&:infa.osgi.enable.workdir.reuse:true&:infa.osgi.
parent.workdir::/tmp/infa&:infa.osgi.workdir.poolsize:4'

Replace <HADOOP_NODE_INFA_HOME> with the Informatica installation directory on the Hadoop cluster.

Replace <HADOOP_DISTRIBUTION> with the Informatica Hadoop installation directory on the Hadoop
cluster. Based on your Hadoop distribution, use one of the following phrases to replace
<HADOOP_DISTRIBUTION>:

• biginsights_<version_number> for BigInsights.


• hortonworks_<version_number> for Hortonworks HDP.
• hdinsight_<version_number> for Azure HDInsight.

Note: If you use Ambari with CSH as the default shell, you must change the command to set.

export
After you create the environment variables, configure the HiveServer2 environment with Ambari or the hive-
env.sh file.

86 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Configure the HiveServer2 Environment with Ambari
After you create the HiveServer2 environment variables with the Big Data Management Configuration Utility
or the modified template, configure the HiveServer2 environment.

If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.

1. Open the HiveServer2_EnvInfa.txt file or the modified template.


2. Copy the contents of the file.
Note: If you use Ambari with CSH as the default shell, you must change the export command to set.
3. Log in to Ambari.
4. Click Hive > Configs > Advanced.
5. Search for the "hive-env template" property.
6. Paste the contents of HiveServer2_EnvInfa.txt or the modified template.
7. Save the changes.
8. Restart the HiveServer2 services.

Configure the HiveServer2 Environment with hive-env.sh


After you create the HiveServer2 environment variables with the Big Data Management Configuration Utility
or the modified template, configure the HiveServer2 environment with the hive-env.sh file.

If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.

1. Open the hive-env.sh file.


You can find hive-env.sh in the following directory: /etc/hive/conf/hive-env.sh.
2. Copy and paste the contents of HiveServer2_EnvInfa.txt or the modified template to the end of hive-
env.sh.
3. Restart HiveServer2 services.

Disable SQL Standard Based Authorization to Run Mappings with


HiveServer2
If the Hadoop cluster uses SQL standard based authorization, you must disable it to run mappings with
HiveServer2.

1. Log in to Ambari.
2. Select Hive > Configs.
3. In the Security section, set Hive Security Authorization to None.
4. Navigate to the Advanced tab for hiveserver2-site.
5. Set Enable Authorization to false.
6. Restart Hive Services.

Configuring Big Data Management in the Hortonworks HDP Environment 87


Enable Storage Based Authorization with HiveServer2
Optionally, you can use storage-based authorization with HiveServer2.

1. Log in to Ambari.
2. Click Hive > Configs.
3. In the Security section, set the Hive Security Authorization to SQLStdAuth.
4. Navigate to Advanced Configs.
5. In the General section, verify that the Hive Authorization Manager property is set to the following value:
Hive Authorization Manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r,org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmb
edOnly
hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuth
orizerFactory
Enable Authorization
Set this property to True.
6. In the Advanced hiveserver2-site section, configure the following properties:
Enable Authorization
Set this value to True.

hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthoriz
erFactory
7. Restart all Hive services.

Enable HBase Support


To use HBase as a source or target when you run a mapping in the Hadoop environment, you must add
hbase-site.xml to a distributed cache.

Perform the following steps:

1. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/infaConf.
2. Edit hadoopEnv.properties.
3. Verify the HBase version specified in infapdo.env.entry.mapred_classpath uses the correct HBase
version for the Hadoop cluster.

88 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


The following sample text shows infapdo.env.entry.mapred_classpath for a Hadoop cluster that uses
HBase version 1.1.1.2.3.0.0-2504:
infapdo.env.entry.mapred_classpath=INFA_MAPRED_CLASSPATH=
$HADOOP_NODE_HADOOP_DIST/lib/hbase-server-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/htrace-core.jar:$HADOOP_NODE_HADOOP_DIST/lib/htrace-
core-2.04.jar:$HADOOP_NODE_HADOOP_DIST/lib/protobuf-java-2.5.0.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-client-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-common-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hive-hbase-handler-1.2.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-protocol-1.1.1.2.3.0.0-2504.jar
4. Add the following entry to the infapdo.aux.jars.path variable: file://$DIS_HADOOP_DIST/conf/
hbase-site.xml.
The following sample text shows infapdo.aux.jars.path with the variable added:
infapdo.aux.jars.path=file://$DIS_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$DIS_HADOOP_DIST/infaLib/hive-infa-plugins-interface.jar,file://
$DIS_HADOOP_DIST/infaLib/profiling-hive0.13.0.hw21-udf.jar,file://$DIS_HADOOP_DIST/
infaLib/hadoop2.2.0-avro_complex_file.jar,file://$DIS_HADOOP_DIST/conf/hbase-site.xml
5. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/conf.
6. In hbase-site.xml and hive-site.xml, verify that thezookeeper.znode.parent property exists and
matches the property set in hbase-site.xml on the cluster.
By default, the ZooKeeper directory on the cluster is /usr/hdp/current/hbase-client/conf.
7. On the machine where the Developer tool runs, go to the following directory: <Informatica installation
directory>\clients\DeveloperClient\hadoop\hortonworks_<version>/conf.
8. In hbase-site.xml and hive-site.xml, verify that thezookeeper.znode.parent property exists and
matches the property set in hbase-site.xml on the cluster.
By default, the ZooKeeper directory on the cluster is /usr/hdp/current/hbase-client/conf.

Enable Support for HBase with HiveServer2


You must configure Big Data Management to run a mapping that uses an HBase source or target with
HiveServer2.

Perform the following steps:

1. Verify that the value for the zookeeper.znode.parent property in the hbase-site.xml file on the
machine where the Data Integration Service runs matches the value on the Hadoop cluster.
The default value is /hbase-unsecure.
You can find the hbase-site.xml file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/<hadoop
distribution>/conf.
You can find the hbase-site.xml file in the following directory on the Hadoop cluster: <Informatica
installation directory>/services/shared/hadoop/<hadoop distribution>.
2. Verify that the infapdo.aux.jars.path property contains the path to the hbase-site.xml file.
The following sample text shows the infapdo.aux.jars.path property with the path for hbase-site.xml:
infapdo.aux.jars.path=file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive-infa-plugins-
interface.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/profiling-hive0.13.0.hw21-
udf.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hadoop2.2.0-
avro_complex_file.jar,file://$HADOOP_NODE_HADOOP_DIST/conf/hbase-site.xml,file://
$HADOOP_NODE_HADOOP_DIST/infaLib/infa_jars.jar

Configuring Big Data Management in the Hortonworks HDP Environment 89


Configure HiveServer2 for DB2 Partitioning
To use HiveServer2 with DB2 database partitioning, use Ambari to configure the to configure the
LD_LIBRARY_PATH for HiveServer2.

Note: If the Hadoop cluster uses RPMs, you must manually edit the hive-env.sh file to add the
<DB2_HOME>/lib64 directory to LD_LIBRARY_PATH. You can find hive-env.sh in the following
directory: /etc/hive/conf

1. Open Ambari.
2. Click Hive > Configs > Advanced.
3. Search for the hive-env template property.
4. Add the following directory to the LD_LIBRARY_PATH property: <DB2_HOME>/lib64.
5. Restart the Hive services.

Configure the Hadoop Cluster for the Blaze Engine


To use the Blaze engine, you must configure the Hadoop cluster.

Complete the following tasks:

• Configure Blaze on Kerberos-enabled clusters.


• Configure the yarn-site.xml file on every node in the Hadoop cluster.
• Start the Application Timeline Server.
• Enable the Blaze Engine console.

Configure Blaze on Kerberos-Enabled Clusters


To use the Blaze runtime engine on Kerberos-enabled clusters, perform the following steps:

1. Copy the following files from the cluster name node:


• core-site.xml
• hdfs-site.xml
2. Paste the files to the following directory on the VM that runs the Data Integration Service:
<InformaticaInstallationDir>/services/shared/hadoop/<HadoopDistributionName>/conf

Configure yarn-site XML for the Application Timeline Server


You must configure properties in the yarn-site.xml file on every node in the Hadoop cluster to enable the
Application Timeline Server.

Configure the following properties:


yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.

Set this value to true.

90 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.

Use 3600000.

yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.

The Blaze engine requires at least 6144 MB.

yarn.nodemanager.local-dirs
List of directories to store localized files in.

The Blaze engine uses local directories for a distributed cache.

The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>6144</value>
<description>Amount of physical memory that can be allotted for containers.</
description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>

Configuring Big Data Management in the Hortonworks HDP Environment 91


Start the Hadoop Application Timeline Server
The Blaze engine uses the Hadoop Application Timeline Server to store the Job monitor status.

To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &

Enable the Blaze Engine Console


Enable the Blaze engine console in the hadoopEnv.properties file.

1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.

Update Cluster Configuration Settings


Perform the following steps to update the Hadoop cluster to enable support for Azure HDInsight.

Note: You can use the Ambari cluster configuration tool to view and edit cluster properties. After you change
property values, the Ambari tool displays the affected cluster components. Restart the affected components
for the changes to take effect.

1. Choose where scripts will be executed.


• Edit each of the script files in the following directory on each of the nodes where you installed Big
Data Management: <Informatica installation home>/services/shared/hadoop/
hortonworks_2.3/scripts to change the value of /bin/sh to /bin/dash.
• If you cannot edit the script files, change the default shell of the system from /bin/dash to bin/bash,
so that /bin/sh system defaults to /bin/bash.
Place scripts in the following path:
<Cluster_Installation_directory>/services/shared/hadoop/hortonworks_2.3/scripts
Note: Restart the affected components for the changes to take effect.
2. Set the value of the fs.defaultFS property to the HDFS location you want.
After you restart, the cluster will populate files in the HDFS location.
Optionally, you can reset fs.defaultFS to the wasb location, and restart the affected components again.
Note: If you are not able to reset the value of the fs.defaultFS property, you can manually copy the entire
folder structure from the wasb location to the local HDFS location.

92 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Configuring Big Data Management in the IBM
BigInsights Environment
You can enable Informatica mappings to run on a Hadoop cluster on IBM BigInsights.

You can use the following runtime engines to run mappings on BigInsights:

• Native runtime engine -- the Data Integration Service


• Blaze engine
• Hive CLI and HiveServer 2

To enable Informatica mappings to run on an IBM BigInsights cluster, complete the following steps:

1. Provider a user account for the JDBC and Hive connections.


2. Configure Support for Data Quality mappings.
3. Configure support for run-time engines.
• For HiveServer2, you configure environment variables and enable support for HBase.
• For Blaze, you perform configuration tasks.

User Account for the JDBC and Hive Connections


To use a Hive source in the native environment or to preview Hive Data in the Developer tool, you must
provide a user name for the JDBC and Hive connections.

Provide an operating system user account that is present on all nodes when you configure the JDBC and
Hive connections in the Developer Tool.

To use an anonymous user with Hive sources in the native environment or Hive data preview, create an
operating system user account named "anonymous" that is present on all nodes. Use this user account when
you set the JDBC and Hive connection properties.

Enable Support for Data Quality Capabilities


To use Data quality capabilities with a Hadoop cluster that runs IBM BigInsights, configure the
hadoopEnv.properties file.

You must add the following path to the infapdo.env.entry.mapred_classpath property in the
hadoopEnv.properties file: $HADOOP_NODE_INFA_HOME/services/shared/jars/shapp/*

The following sample text shows the infapdo.env.entry.mapred_classpath property with the
$HADOOP_NODE_INFA_HOME/services/shared/jars/shapp/* path:
infapdo.env.entry.mapred_classpath=INFA_MAPRED_CLASSPATH=$HADOOP_NODE_HADOOP_DIST/lib/*:
$HADOOP_NODE_HADOOP_DIST/lib/protobuf-java-2.5.0.jar:$HADOOP_NODE_HADOOP_DIST/lib/hbase-
client.jar:$HADOOP_NODE_HADOOP_DIST/lib/hbase-common.jar:$HADOOP_NODE_HADOOP_DIST/lib/
hive-hbase-handler.jar:$HADOOP_NODE_HADOOP_DIST/lib/hbase-protocol.jar:
$HADOOP_NODE_HADOOP_DIST/infaLib/*:$HADOOP_NODE_INFA_HOME/services/shared/jars/*:
$HADOOP_NODE_INFA_HOME/services/shared/jars/platform/*:$HADOOP_NODE_INFA_HOME/services/
shared/jars/platform/dtm/*:$HADOOP_NODE_INFA_HOME/services/shared/jars/thirdparty/*:
$HADOOP_NODE_HADOOP_DIST/infaLib/*:$HADOOP_NODE_INFA_HOME/plugins/infa/*:
$HADOOP_NODE_INFA_HOME/plugins/dynamic/*:$HADOOP_NODE_INFA_HOME/plugins/osgi/*:
$HADOOP_NODE_HADOOP_DIST/lib/htrace-core.jar:$HADOOP_NODE_INFA_HOME/services/shared/
jars/shapp/*:$HADOOP_NODE_HADOOP_DIST/lib/htrace-core-3.1.0-incubating.jar:
$HADOOP_CONF_DIR

Configuring Big Data Management in the IBM BigInsights Environment 93


You can find the hadoopEnv.properties file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
<Hadoop_distribution_name>/infaConf.

Create the HiveServer2 Environment Variables


Before you can configure the HiveServer2 environment, create the required environment variables.

You can run the Big Data Management Configuration Utility and select HiveServer2 to generate the
HiveServer2_EnvInfa.txt file. Alternatively, you can modify a template to create the required environment
variables.

Modify the following template:


export LD_LIBRARY_PATH=/<HADOOP_NODE_INFA_HOME>/services/shared/
bin:<HADOOP_NODE_INFA_HOME>/services/shared/hadoop/<HADOOP_DISTRIBUTION>/lib/native:
$LD_LIBRARY_PATH
export INFA_HADOOP_DIST_DIR=<HADOOP_NODE_INFA_HOME>/services/shared/hadoop/
<HADOOP_DISTRIBUTION>
export INFA_PLUGINS_HOME=<HADOOP_NODE_INFA_HOME>/Informatica/plugins

export TMP_INFA_AUX_JARS=$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.4.0-hdfs-native-impl.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.7.1.hw23-native-impl.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hbase1.1.2-infa-plugins.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-
boot.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-plugins.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-storagehandler.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hive0.14.0-native-impl.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive1.1.0-
avro_complex_file.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive-infa-plugins-interface.jar:
$INFA_HADOOP_DIST_DIR/infaLib/infa-hadoop-hdfs.jar:$INFA_HADOOP_DIST_DIR/infaLib/
profiling-hive0.14.0-udf.jar:/opt/Informatica/infa_jars.jar

if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then


export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH:$TMP_INFA_AUX_JARS
else
export HIVE_AUX_JARS_PATH=$TMP_INFA_AUX_JARS
fi

export JAVA_LIBRARY_PATH=<HADOOP_NODE_INFA_HOME>/services/shared/bin
export INFA_RESOURCES=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export INFA_HOME=<HADOOP_NODE_INFA_HOME>
export IMF_CPP_RESOURCE_PATH=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export
INFA_MAPRED_OSGI_CONFIG='osgi.framework.activeThreadType:false&:org.osgi.framework.stora
ge.clean:none&:eclipse.jobs.daemon:true&:infa.osgi.enable.workdir.reuse:true&:infa.osgi.
parent.workdir::/tmp/infa&:infa.osgi.workdir.poolsize:4'

Replace <HADOOP_NODE_INFA_HOME> with the Informatica installation directory on the Hadoop cluster.

Replace <HADOOP_DISTRIBUTION> with the Informatica Hadoop installation directory on the Hadoop
cluster. Based on your Hadoop distribution, use one of the following phrases to replace
<HADOOP_DISTRIBUTION>:

• biginsights_<version_number> for BigInsights.


• hortonworks_<version_number> for Hortonworks HDP.
• hdinsight_<version_number> for Azure HDInsight.

Note: If you use Ambari with CSH as the default shell, you must change the export command to set.

After you create the environment variables, configure the HiveServer2 environment with Ambari or the hive-
env.sh file.

94 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


Configure the HiveServer2 Environment with Ambari
After you create the HiveServer2 environment variables with the Big Data Management Configuration Utility
or the modified template, configure the HiveServer2 environment.

If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.

1. Open the HiveServer2_EnvInfa.txt file or the modified template.


2. Copy the contents of the file.
Note: If you use Ambari with CSH as the default shell, you must change the export command to set.
3. Log in to Ambari.
4. Click Hive > Configs > Advanced.
5. Search for the "hive-env template" property.
6. Paste the contents of HiveServer2_EnvInfa.txt or the modified template.
7. Save the changes.
8. Restart the HiveServer2 services.

Configure the HiveServer2 Environment with hive-env.sh


After you create the HiveServer2 environment variables with the Big Data Management Configuration Utility
or the modified template, configure the HiveServer2 environment with the hive-env.sh file.

If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.

1. Open the hive-env.sh file.


You can find hive-env.sh in the following directory: /etc/hive/conf/hive-env.sh.
2. Copy and paste the contents of HiveServer2_EnvInfa.txt or the modified template to the end of hive-
env.sh.
3. Restart HiveServer2 services.

Enable Support for HBase with HiveServer2


You must configure Big Data Management to run a mapping that uses an HBase source or target with
HiveServer2.

You must add the path for the hbase-site.xml file to the infapdo.aux.jars.path property in the
hadoopEnv.properties file.

The following sample text shows the infapdo.aux.jars.path property with the path for hbase-site.xml:
infapdo.aux.jars.path=file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/profiling-hive0.13.0-udf.jar,file://
$HADOOP_NODE_HADOOP_DIST/infaLib/hive-infa-plugins-interface.jar,file://
$HADOOP_NODE_INFA_HOME/infa_jars.jar,file://$HADOOP_NODE_HADOOP_DIST/conf/hbase-site.xml

You can find the hadoopEnv.properties file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
<Hadoop_distribution_name>/infaConf.

Configuring Big Data Management in the IBM BigInsights Environment 95


Configure the Hadoop Cluster for the Blaze Engine
To use the Blaze engine, you must configure the Hadoop cluster.

Complete the following tasks:

• Configure Blaze on Kerberos-enabled clusters.


• Configure the yarn-site.xml file on every node in the Hadoop cluster.
• Start the Application Timeline Server.
• Enable the Blaze Engine console.

Configure yarn-site XML for the Application Timeline Server


You must configure properties in the yarn-site.xml file on every node in the Hadoop cluster to enable the
Application Timeline Server.

Configure the following properties:


yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.

Set this value to true.

yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.

Use 3600000.

yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.

The Blaze engine requires at least 6144 MB.

yarn.nodemanager.local-dirs
List of directories to store localized files in.

The Blaze engine uses local directories for a distributed cache.

The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>

96 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>6144</value>
<description>Amount of physical memory that can be allotted for containers.</
description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>

Start the Hadoop Application Timeline Server


The Blaze engine uses the Hadoop Application Timeline Server to store the Job monitor status.

To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &

Enable the Blaze Engine Console


Enable the Blaze engine console in the hadoopEnv.properties file.

1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.

Configuring Big Data Management in the MapR


Environment
You can enable Informatica mappings to run on a Hadoop cluster on MapR.

Before you configure the Informatica domain and the MapR cluster to run mappings, download and install
EBF 17557. This EBF release supports MapR ticket and Kerberos-enabled MapR clusters.

Configuring Big Data Management in the MapR Environment 97


To enable Informatica mappings to run on a MapR cluster that uses MapReduce 2, complete the following
tasks:

1. Verify the cluster details.


2. Download and install EBF 17557.
3. Configure the Informatica domain to communicate with a Kerberos-enabled MapR cluster, or a MapR
cluster that uses MapR ticket authentication.
4. Configure Hive and HDFS Metadata Fetch.
5. Configure environment variables in the Hadoop environment properties file.
6. Configure Hadoop cluster properties on the Data Integration Service machine for MapReduce 2.
7. Configure yarn-site.xml for MapReduce 2.
8. Edit warden.conf to configure heap space.
9. Configure the Blaze engine.
10. Configure the Developer tool.

Verify the Cluster Details


When you configure Big Data Management to run on a MapR cluster, verify the following cluster settings:

MapReduce Version
Verify that the cluster is configured for the correct version of MapReduce. You can use the MapR Control
System (MCS) to change the MapReduce version. Then, restart the cluster.

MapR User Details


Verify that the MapR user exists on each Hadoop cluster node and that the following properties are set
to identical values:

• User ID (uid)
• Group ID (gid)
• Groups

For example, MapR User details might be set to the following values:

• uid=2000(mapr)
• gid=2000(mapr)
• groups=2000(mapr)

Data Integration Service User Details


Verify that the user who runs the Data Integration Service is assigned the same gid as the MapR user
and belongs to the same group.

For example, a Data Integration Service user named testuser, might have the following properties:

• uid=30103(testuser)
• gid=2000(mapr)
• groups=2000(mapr)

After you verify the Data Integration Service user details, perform the following steps on every node in
the cluster:

1. Use a tool like putty to communicate with the node using ssh.

98 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments


2. Use the Linux command line to create a cluster user that has the same user ID, group ID, and name
as the Data Integration Service user.
3. Add this user to all the nodes in the Hadoop cluster and assign it to the mapr group.
4. Verify that the user you created has read and write permissions for the following directory: /opt/
mapr/hive/hive-<version>/logs.
A directory corresponding to the user will be created at this location.
5. Verify that the user you created has permissions for the Hive warehouse directory.
The Hive warehouse directory is set in the following file: /opt/mapr/hive/hive-<version>/conf/
hive-site.xml.
For example, if the warehouse directory is /user/hive/warehouse, run the following command to
grant the user permissions for the directory:
hadoop fs –chmod –R 777 /user/hive/warehouse

Install the EBF


Before you can configure Big Data Management 10.1 to enable mappings to run on a MapR cluster, you must
download and install EBF 17588 on top of Big Data Management 10.1.

Verify Prerequisites
Before you download and install the EBF, verify that you have the following environment:

• You can access a running Informatica domain that includes a Model Repository Service and a Data
Integration Service.
• You have the Developer client installed on a machine in your cluster.
• Informatica Big Data Management 10.1 RPM packages or Cloudera parcels are installed on your Hadoop
cluster.

EBF Installation
Install EBF 17588 on top of Informatica Big Data Management 10.1.

EBF 17588 contains the following archive files:


EBF17588_HadoopEBF_EBFInstaller.tar
This archive contains RPM archives that you need to update Informatica product binaries.

EBF17588_Server_Installer_linux_em64t.tar
This archive contains Linux updates for servers and the Big Data Management Configuration Utility.

EBF17588_Client_Installer_win_em64t.tar
This archive contains updates to clients, including the Developer tool.

INFORMATICA-10.1.0.informatica10.1.0.p1.364.parcel.tar
This archive contains updates to Big Data Management support for Cloudera clusters.

Contact Informatica Global Customer Support for the link to download and install EBF 17588. Then perform
the following tasks:

1. Download and uncompress the downloaded archive file.

Configuring Big Data Management in the MapR Environment 99


2. Install the RPM binaries.
a. Edit the input.properties file with the following information:
• DEST_DIR - destination directory for Informatica Big Data Management.
Note: Install the binaries on each node of the cluster.
b. Type installEBF.sh to run the installer.
c. Accept the license terms.
d. Select 1 to install the package on a local cluster node.
After you install RPM binaries, perform the post-installation and configuration steps for your Hadoop
distribution. See the Big Data Management 10.1 Installation and Configuration Guide.
3. Install server updates and the updated Big Data Management Configuration Utility.
a. Edit the input.properties file with the following information:
• DEST_DIR - destination directory on the on the machine where the Data Integration Service
runs.
b. Type installEBF.sh to run the installer.
4. Update clients.
a. Edit the input.properties file with the following information:
• DEST_DIR - destination directory where clients are installed.
b. Type installEBF.sh to run the installer.
5. Copy .jar files to the /lib folder in the Informatica instance on each cluster node.
1. From a command prompt, cd to the following directory:
/opt/Informatica/services/shared/hadoop/<hadoop_distribution><version_number>/lib
2. Copy the following .jar files to the /lib folder:
• /usr/lib/hadoop/lib/emr-metrics-client-2.1.0.jar
• /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-2.6.0.jar
• /usr/share/aws/emr/emrfs/lib/emr-core-2.5.0.jar
• /usr/lib/hadoop/hadoop-common-2.7.2-amzn-1.jar
3. Copy the same .jar files to the following location on the Informatica domain:
<Big Data Management installation directory>/services/shared/hadoop/
<hadoop_distribution><version_number>/lib

Configure the Informatica Domain to Communicate with a


Kerberos-Enabled MapR 5.1 Cluster
To configure the Informatica domain to enable mappings to run on a Kerberos-enabled MapR 5.1 cluster,
perform the following tasks:

1. Generate a MapR ticket.


2. Enable the TLS protocol and Kerberos.
3. Copy the Truststore file to the Data Integration Service machine.
4. Add additional properties for the runtime engine you want to use.
5. Configure Hive connection properties.
6. Edit hive-site.xml to add properties to enable mappings to use the Hive run-time engine.

100 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
Generate a MapR Ticket
To enable mappings to run on a Kerberos-enabled MapR cluster, generate a MapR ticket for the Data
Integration Service user.

1. Run the MapR kinit utility on the CLDB node of the cluster to create a Kerberos ticket for the Data
Integration Service user.
For information about how to generate MapR Tickets, refer to MapR documentation.
2. Run the maprlogin kerberos utility. Type:
maprlogin kerberos
The utility generates a MapR ticket in the /tmp directory using the following naming convention:
maprticket_<userid>
where <userid> corresponds to the Data Integration Service user.
3. Copy the ticket file from the cluster node to the following directory on the VM that runs the Data
Integration Service:
/tmp

Enable the TLS Protocol and Kerberos


Enable the TLS security protocol in the Administrator tool.

1. In the Administrator tool, browse to the Data Integration Service Process Properties tab.
2. In the Advanced Properties area, add the following line to the JVM Command Line Options:
-Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the MAPR_ECOSYSTEM_LOGIN_OPTS
property in the file /opt/mapr/conf/env.sh.
3. Restart the Data Integration Service for the change to take effect.

Copy the Truststore File to the Data Integration Service Machine


1. Copy the truststore file from the following location on the cluster:
/opt/mapr/conf/ssl_truststore
2. Paste the truststore file to the following location on the Data Integration Service host:
<Informatica installation directory>/services/shared/hadoop/mapr_5.1.0/conf/
ssl_truststore

Configure Run-time Engines


You can run mappings in the Informatica native environment, or choose a run-time engine to run mappings in
the Hadoop environment.

When you choose the native run-time engine, Big Data Management uses the Data Integration to run
mappings on the Informatica domain. You can also choose a run-time engine to run mappings in the Hadoop
environment. This pushes mapping run processing to the cluster.

When you want to run mappings on the cluster, you choose from the following run-time engines:
Blaze engine
The Blaze engine is an Informatica software component that can run mappings on the Hadoop cluster.

Configuring Big Data Management in the MapR Environment 101


Spark engine
Spark is an Apache project that provides a run-time engine that can run mappings on the Hadoop
cluster.

Hive engine
When you run mappings on the Hive run-time engine, you choose Hive Command Line Interface or
HiveServer 2.

Note: Hive Command Line Interface is commonly abbreviated Hive CLI.

Configure the Blaze Engine


To use the Blaze engine, you must configure the Hadoop cluster. Complete the following tasks:

• Configure the yarn-site.xml file on every node in the Hadoop cluster.


• Add additional properties for the Blaze runtime engine.
• Start the Application Timeline Server.
• Enable the Blaze Engine console.

Configure yarn-site.xml for the Application Timeline Server


You must configure properties in the yarn-site.xml file on every node in the Hadoop cluster to enable the
Application Timeline Server.

Configure the following properties:

yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.

Set this value to true.

yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.

Use the host name of the machine that starts the Application Timeline Server for the host name.

yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.

Use 3600000.

yarn.nodemanager.local-dirs
List of directories to store localized files in.

The Blaze engine uses local directories for a distributed cache.

102 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>

Add Additional Properties for the Blaze Runtime Engine


To use the Blaze engine to run mappings, perform the following steps in the Administrator tool to define
custom properties:

1. In the Administrator tool, browse to the Data Integration Service Process tab.
2. In the environment variables area, define the Kerberos authentication protocol:

Property Value

JAVA_OPTS -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -Dhttps.protocols=TLSv1.2


where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the MAPR_ECOSYSTEM_LOGIN_OPTS
property in the file /opt/mapr/conf/env.sh.

Start the Application Timeline Server


The Blaze engine uses the Hadoop Application Timeline Server to store the Job monitor status.

To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &

Configuring Big Data Management in the MapR Environment 103


Enable the Blaze Engine Console
To enable the Blaze engine console in the hadoopEnv.properties file, perform the following steps:

1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.

Add Additional Properties for the Native Runtime Engine


To use the native Data Integration Service to run mappings, define custom properties in the Administrator
tool.

1. In the Administrator tool, browse to the Data Integration Service Process tab.
2. In the Custom Properties area, define the following properties and values:

Property Value

ExecutionContextOptions.JVMOption2 -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -
Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the
MAPR_ECOSYSTEM_LOGIN_OPTS property in the file /opt/mapr/conf/
env.sh.

ExecutionContextOptions.JVMOption7 -Dhttps.protocols=TLSv1.2

Configure Hive Connection Properties


To use Hive to run mappings, you configure the Data Access Connection String with the service principal
name.

1. In the Administrator tool, browse to the Connections tab and browse to the HiveServer2 Connection
Properties area.
2. Configure the following connection properties:

Property Value

Metadata Connection jdbc:hive2://<domain_host>:<port_number>/


String default;principal=<service_principal_name>

Data Access Connection jdbc:hive2://<domain_host>:<port_number>/


String default;principal=<service_principal_name>
Note: You can retrieve the service principal name from the MapR Control System browser.

104 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
3. In the environment variables area, configure the following property to define the Kerberos authentication
protocol:

Property Value

JAVA_OPTS -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -Dhttps.protocols=TLSv1.2


where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the MAPR_ECOSYSTEM_LOGIN_OPTS
property in the file /opt/mapr/conf/env.sh.

Edit hive-site.xml to Enable a Mapping to Run with the Hive Run-Time Engine
To run mappings using Hive, open the file <Informatica installation directory>/services/shared/
hadoop/mapr_<version>/conf/hive-site.xml for editing and make the following changes:

1. Locate the hive.server2.authentication property and change the value as follows:


<property>
<name>hive.server2.authentication</name>
<value>kerberos</value>
<description> </description>
</property>
2. Add the following property:
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
<description> </description>
</property>
3. Open the file /opt/mapr/conf/hive-site.xml on the cluster and copy the following properties:
• hive-metastore.principal
• hive-metastore.keytab
Paste both properties to <Informatica installation directory>/services/shared/hadoop/
mapr_<version>/conf/hive-site.xml

Configure the Informatica Domain to Communicate with a Cluster


that Uses MapR Ticket Authentication
You can use the following runtime engines to run mappings on MapR clusters that use the MapR Ticket
method of authentication:

• Native (Data Integration Service)


• Blaze
• Hive

To configure the Informatica domain to enable mappings to run on a MapR 5.1 cluster that uses MapR ticket
for authentication perform the following steps:

1. Retrieve the value of the hive.server2.authentication property.


2. Open the following file for editing:
<Informatica Installation Directory>/services/shared/hadoop/mapr_5.1.0/conf/hive-
site.xml
3. Change the value of the hive.server2.authentication property from NONE to the
hive.server2.authentication property value that you obtained from the cluster.

Configuring Big Data Management in the MapR Environment 105


4. Add the following property to the hive-site.xml file:
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
Then save and close the hive-site.xml file.
5. Generate a MapR Ticket file for the Data Integration Service user on the cluster.
To understand how to generate MapR tickets, refer to MapR documentation.
6. Copy the MapR ticket file to the following directory on the VM that runs the Data Integration Service:
/tmp

Configure Hive and HDFS Metadata Fetch for MapR Ticket or


Kerberos
Note: Before performing this task, complete the steps in the topic "Configure the Developer Tool" in the
Informatica 10.1 Big Data Management Installation and Configuration Guide.

To configure users for MapR Ticket or Kerberos-enabled MapR clusters, establish Linux accounts and
configure user permissions for users.

1. Create a Linux user on the node where the HiveServer2 service runs. Use the same username as the
Windows user account that runs the Developer tool client. We will refer to this user as the client user.
2. If the cluster is Kerberos-enabled, you can perform the following steps to generate a MapR ticket.
Alternatively, follow steps 3 and 4.
a. Install maprclient on the Windows machine.
b. Generate a Kerberos ticket on the Windows machine.
c. Use maprlogin to generate a maprticket at %TEMP%.
Skip to step 5.
3. On the same node, log in as the client user and generate a MapR ticket.
Refer to MapR documentation for more information.
If the cluster is not Kerberos-enabled, follow these steps:
a. Type the following command:
maprlogin password
b. When prompted, provide the password for the client user.
If the cluster is Kerberos-enabled, follow these steps:
a. Generate a Kerberos ticket using kinit.
b. Type the following command to generate a maprticket:
maprlogin kerberos
The cluster generates a MapR ticket associated with the client user. By default, tickets on Linux systems
are generated in the /tmp directory and have a name like maprticket_<username>.
4. Copy the MapR ticket file and paste it to the %TEMP% directory on the Windows machine.
5. Rename the file like this:
maprticket_<username>
where <username> is the username of the client user.
6. On the MapR Control System browser, get the value of the property hive.server2.authentication.

106 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
7. Open the file <Informatica_client_installation>\clients\DeveloperClient\hadoop
\mapr_<version_number>\conf\hive-site.xml for editing.
8. Change the value of the property hive.server2.authentication from NONE to the value you got in Step 5.
Note: If Kerberos is enabled on the cluster, comment out the hive.server2.authentication property in
hive-site.xml.
9. Add the following lines to the hive-site.xml file:
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
10. Save and close the hive-site.xml file.
To test the Hive connection, or perform a metadata fetch task, use the following format for the connection
string if the cluster is Kerberos-enabled:
jdbc:hive2://<hostname>:10000/default;principal=<SPN>

Example:
jdbc:hive2://myServer2:10000/default;principal=mapr/myServer2@clustername

If custom authentication is enabled, specify the user name and password in the Database Connection tab of
the Hive connection.

Note: When the mapping performs a metadata fetch of a complex file object, the user whose maprticket is
present at %TEMP% on the Windows machine must have read permission on the HDFS directory to list the
files inside it and perform the import action. The metadata fetch operation ignores privileges of the user who
is listed in the HDFS connection definition.

Running Mappings Using the Teradata Connector for Hadoop on a


Hive or Blaze Engine
When you use Teradata Connector for Hadoop to run Teradata mappings on a Hive or Blaze engine, the
Data Integration Service user must have the MapR Ticket available on all the nodes of the Hadoop cluster.

Configure Environment Variables for MapR 5.1 in the Hadoop


Environment Properties File
Configure MapR environment variables to use the MapR distribution to run mappings in a Hive environment.

Configure the following MapR variables:

• Add MAPR_HOME to the environment variables in the Data Integration Service Process properties. Set
MAPR_HOME to the following path: <Informatica installation directory>/services/shared/
hadoop/mapr_<version_number>/.
• Add -Dmapr.library.flatclass to the custom properties in the Data Integration Service Process properties.
For example, add
JVMOption1=-Dmapr.library.flatclass
• When you use the MapR distribution on the Linux operating system, change the environment variable
LD_LIBRARY_PATH to include the following path: <Informatica Installation Directory>/services/
shared/hadoop/mapr_<version>/lib/native/Linux-amd64-64:.:<Informatica Installation
Directory>/services/shared/bin.
• Add -Dmapr.library.flatclass to the Data Integration Service advanced property JVM Command Line
Options.

Configuring Big Data Management in the MapR Environment 107


• Set the MapR Container Location Database name variable CLDB in the following file: <Informatica
installation directory>/services/shared/hadoop/mapr_<version>/conf/mapr-clusters.conf.
For example, add the following property:
INFAMAPR51 secure=false <master_node_name>:7222
• Copy the warden.conf file from /opt/mapr/ in the Hadoop cluster to the following path:
<Informatica installation directory>/services/shared/hadoop/mapr_<version>/conf

Configure Hadoop Cluster Properties on the Data Integration


Service Machine for MapReduce 2
If the MapR cluster uses MapReduce 2, you must configure the Hadoop cluster properties in hive-site.xml
and yarn-site.xml on the machine where the Data Integration Service runs.

hive-site.xml and yarn-site.xml are located in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
mapr_<version_number>_yarn/conf/.

In hive-site.xml, configure the following property:


yarn.app.mapreduce.am.staging-dir
Location of the staging directory for the Hadoop cluster.

The following sample code describes the property you can set in hive-site.xml:
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>maprfs:<staging directory path></value>
</property>

In yarn-site.xml, configure the following properties:


mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default value is 10020.

Use the value in the following file: /opt/mapr/hadoop/hadoop-<version_number>/etc/hadoop/mapred-


site.xml

mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888.

Use the value in the following file: /opt/mapr/hadoop/hadoop-<version_number>/etc/hadoop/mapred-


site.xml

yarn.resourcemanager.scheduler.address
Scheduler interface address. The default value is 8030.

Use the value in the following file: /opt/mapr/hadoop/hadoop-<version_number>/etc/hadoop/yarn-


site.xml

yarn.resourcemanager.webapp.address
Resource Manager web application address.

Use the value in the following file: /opt/mapr/hadoop/hadoop-<version_number>/etc/hadoop/yarn-


site.xml

The following sample code describes the properties you can set in yarn-site.xml:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>

108 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
</property>

<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:port</value>
<description>The address of the scheduler interface</description>
</property>

<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hostname:port</value>
<description>The address for the Resource Manager web application.</description>
</property>

Configure yarn-site.xml for MapReduce 2


Configure Hadoop cluster properties in the yarn-site.xml file to enable MapReduce2 on every node in the
Hadoop cluster.

yarn-site.xml is located in the following directory on the Hadoop cluster nodes: /opt/mapr/hadoop/
hadoop-<version>/etc/hadoop.

In yarn-site.xml, configure the following properties:

Note: If a property does not exist in yarn-site.xml, add it to the file.

yarn.nodemanager.resource.memory-mb
Amount of physical memory, in megabytes, that can be allocated for containers.

Use "24000" for the value.

yarn.scheduler.minimum-allocation-mb
The minimum allocation for every container request at the RM, in megabytes. Memory requests lower
than this do not take effect, and the specified value will get allocated.

Use "2048" for the value.

yarn.scheduler.maximum-allocation-mb
The maximum allocation for every container request at the RM, in megabytes. Memory requests higher
than this do not take effect and are capped at this value.

Use "24000" for the value.

yarn.app.mapreduce.am.resource.mb
The amount of memory that the MR AppMaster needs.

Use "2048" for the value.

yarn.nodemanager.resource.cpu-vcores
Number of CPU cores that can be allocated for containers.

Use "8" for the value.

To use the Blaze engine, you must go on to configure additional properties in the yarn-site.xml file to enable
the Application Timeline Server.

Configuring Big Data Management in the MapR Environment 109


The following sample code shows the properties you can configure in yarn-site.xml:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<description> Amount of physical memory, in MB, that can be allocated for containers.</
description>
<value>24000</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<description>The minimum allocation for every container request at the RM, in MBs.
Memory requests lower than this won't take effect, and the specified value will get
allocated at minimum.</description>
<value>2048</value>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<description> The maximum allocation for every container request at the RM, in MBs.
Memory requests higher than this won't take effect, and will get capped to this value.</
description>
<value>24000</value>
</property>

<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<description> The amount of memory the MR AppMaster needs.</description>
<value>2048</value>
</property>

<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<description> Number of CPU cores that can be allocated for containers. </description>
<value>8</value>
</property>

Edit warden.conf to Configure Heap Space


You must configure the heap space reserved for the MapR-FS on every node in the cluster.

1. Navigate to the following directory: /opt/mapr/conf.


2. Edit the warden.conf file.
3. Set the value for the service.command.mfs.heapsize.percent property to 20.
4. Save and close the file.
5. Repeat steps 1 through 4 for every node in the cluster.
6. Restart the cluster.

Configure the Developer Tool


To configure the Developer tool, perform the following steps:

1. Go to the following directory on any node in the Hadoop cluster: <MapR installation directory>/
conf .
2. Find the mapr-cluster.conf file.
3. Copy the file to the following directory on the machine on which the Developer tool runs: <Informatica
installation directory>\clients\DeveloperClient\hadoop\mapr_<version_number>\conf
4. Go to the following directory on the machine on which the Developer tool runs: <Informatica
installation directory>\<version_number>\clients\DeveloperClient

110 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
5. Edit run.bat to include the MAPR_HOME environment variable and the -clean settings:
For example, include the following lines:
<Informatica installation directory>\clients\DeveloperClient\hadoop\mapr_510
developerCore.exe -clean
6. Save and close the file.
7. Add the following values to the developerCore.ini file:
-Dmapr.library.flatclass
-Djava.library.path=hadoop\mapr_<version_number>\lib\native\Win32;bin;..\DT\bin
You can find developerCore.ini in the following directory: <Informatica installation directory>
\clients\DeveloperClient
8. Save and close the file.
9. Use run.bat to start the Developer tool.

Configuring Big Data Management in the MapR Environment 111


CHAPTER 4

High Availability
This chapter includes the following topics:

• Configure High Availability, 112


• Configuring Big Data Management for a Highly Available Cloudera CDH Cluster, 113
• Enable Support for a Highly Available Hortonworks HDP Cluster, 114
• Configuring Big Data Management for a Highly Available IBM BigInsights Cluster, 119
• Configuring Informatica for Highly Available MapR, 119

Configure High Availability


You can configure Big Data Management to read from and write to a highly available Hadoop cluster.

A highly available Hadoop cluster can provide uninterrupted access to the JobTracker, name node, and
ResourceManager in the cluster. The JobTracker is the service within Hadoop that assigns MapReduce jobs
on the cluster. The name node tracks file data across the cluster. The ResourceManager tracks resources
and schedules applications in the cluster.

You can configure Big Data Management to communicate with a highly available Hadoop cluster on the
following Hadoop distributions:

• Cloudera CDH
• Hortonworks HDP
• IBM BigInsights
• MapR

112
Configuring Big Data Management for a Highly
Available Cloudera CDH Cluster
You can configure the Data Integration Service and the Developer tool to read from and write to a highly
available Cloudera CDH cluster. The Cloudera CDH cluster provides a highly available name node and
ResourceManager.

1. Go to the following directory on the name node of the cluster:


/etc/hadoop/conf
2. Locate the following files:
• hdfs-site.xml
• yarn-site.xml
3. Note: If you use the Big Data Management Configuration Utility to configure Big Data Management, skip
this step.
Copy the files to the following directory on the machine where the Data Integration Service runs:
<Informatica installation directory>/services/shared/hadoop/cloudera_cdh<version>/conf
4. Copy the files to the following directory on the machine where the Developer tool runs:
<Informatica installation directory>/clients/DeveloperClient/Hadoop/
cloudera_cdh<version>/conf
5. Edit yarn-site.xml.
6. Find the yarn.application.classpath property.
7. Set the value to the classpath for YARN applications.
To find the classpath, run the following command:
yarn classpath
The following sample text shows yarn.application.classpath with a sample classpath:
<property>
<name>yarn.application.classpath</name>
<value>
/etc/hadoop/conf:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/
libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/
hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/
CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/
parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/
cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/libexec/../../hadoop-
hdfs/.//*:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/libexec/../../
hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/
libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-
mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*:/opt/cloudera/
parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/
cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/libexec/../../hadoop-
yarn/lib/*
</value>
</property>
8. Save and close yarn-site.xml.
9. Open the Developer tool.
10. Click Window > Preferences.
11. Select Informatica > Connections.
12. Expand the domain.
13. Expand Databases and select the Hive connection.

Configuring Big Data Management for a Highly Available Cloudera CDH Cluster 113
14. Edit the Hive connection and configure the following properties in the Properties to Run Mappings in
Hadoop Cluster tab:
Default FS URI
Use the value from the dfs.nameservices property in hdfs-site.xml.

Job tracker/Yarn Resource Manager URI


Enter any value in the following format: <string>:<port>. For example, enter dummy:12435.
15. Expand File Systems and select the HDFS connection.
16. Edit the HDFS connection and configure the following property in the Details tab:
Name Node URI
Use the value from the dfs.nameservices property in hdfs-site.xml.

Enable Support for a Highly Available Hortonworks


HDP Cluster
You can enable Data Integration Service and the Developer tool to read from and write to a highly available
Hortonworks cluster. The Hortonworks cluster provides a highly available name node and ResourceManager.

To enable support for a highly available Hortonworks HDP cluster, perform the following tasks:

1. Configure cluster properties for a highly available name node.


2. Configure cluster properties for a highly available ResourceManager.
3. Configure the connection to the cluster.

Configure Cluster Properties for a Highly Available Name Node


You must configure cluster properties in hive-site.xml to enable support for a highly available name node.

On the machine where the Data Integration Service runs, you can find hive-site.xml in the following
directory: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf.

Configure the following properties in hive-site.xml:


dfs.ha.automatic-failover.enabled
This property determines whether automatic failover is enabled.

Set this value to true.

dfs.ha.namenodes.<ClusterName>
The ClusterName is specified in the dfs.nameservice property. The following sample text shows the
property for a cluster named cluster01: dfs.ha.namenodes.cluster01.

Specify the name node IDs with a comma separated list. For example, you can use the following values:
nn1,nn2.

dfs.namenode.https-address
The HTTPS server that the name node listens on.

114 Chapter 4: High Availability


dfs.namenode.https-address.<ClusterName>.<Name_NodeID>
The HTTPS server that a highly available name node specified in dfs.ha.namenodes.<ClusterName>
listens on. Each name node requires a separate entry. For example, if you have two highly available
name node, you must have two corresponding dfs.namenode.https-
address.<ClusterName>.<Name_NodeID> properties.

The following sample text shows a name node with the ID nn1 on a cluster named cluster01:
dfs.namenode.https-address.cluster01.nn1

dfs.namenode.http-address
The HTTP server that the name node listens on.

dfs.namenode.http-address.<ClusterName>.<Name_NodeID>
The HTTPS server that a highly available name node specified in dfs.ha.namenodes.<ClusterName>
listens on. Each name node requires a separate entry. For example, if you have two highly available
name node, you must have two corresponding dfs.namenode.http-
address.<ClusterName>.<Name_NodeID> properties.

The following sample text shows a name node with the ID nn1 on a cluster named cluster01:
dfs.namenode.http-address.cluster01.nn1

dfs.namenode.rpc-address
The fully-qualified RPC address for the name node to listen on.

dfs.namenode.rpc-address.<ClusterName>.<Name_NodeID>
The fully-qualified RPC address for a highly available name node specified in
dfs.ha.namenodes.<ClusterName> listens on. Each name node requires a separate entry. For example,
if you have two highly available name node, you must have two corresponding dfs.namenode.rpc-
address.<ClusterName>.<Name_NodeID> properties.

The following sample text shows a name node with the ID nn1 on a cluster named cluster01:
dfs.namenode.rpc-address.cluster01.nn1.

The following sample text shows the properties for two highly available name node with the IDs nn1 and nn2
on a cluster named cluster01:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

<property>
<name>dfs.namenodes.cluster01</name>
<value>nn1,nn2</value>
</property>

<property>
<name>dfs.namenode.https-address</name>
<value>node01.domain01.com:50470</value>
</property>

<property>
<name>dfs.namenode.https-address.cluster01.nn1</name>
<value>node01.domain01.com:50470</value>
</property>

<property>
<name>dfs.namenode.https-address.cluster01.nn2</name>
<value>node02.domain01.com:50470</value>
</property>

<property>
<name>dfs.namenode.http-address</name>

Enable Support for a Highly Available Hortonworks HDP Cluster 115


<value>node01.domain01.com:50070</value>
</property>

<property>
<name>dfs.namenode.http-address.cluster01.nn1</name>
<value>node01.domain01.com:50070</value>
</property>

<property>
<name>dfs.namenode.http-address.cluster01.nn2</name>
<value>node02.domain01.com:50070</value>
</property>

<property>
<name>dfs.namenode.rpc-address</name>
<value>node01.domain01.com:8020</value>
</property>

<property>
<name>dfs.namenode.rpc-address.cluster01.nn1</name>
<value>node01.domain01.com:8020</value>
</property>

<property>
<name>dfs.namenode.rpc-address.cluster01.nn2</name>
<value>node02.domain01.com:8020</value>
</property>

Configure Cluster Properties for a Highly Available Resource


Manager
You must configure cluster properties in yarn-site.xml to enable support for a highly available Resource
Manager.

On the machine where the Data Integration Service runs, you can find yarn-site.xml in the following
directory: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf.

Configure the following properties in yarn-site.xml:


yarn.resourcemanager.ha.enabled
This property determines whether high availability is enabled for Resource Managers.

Set this value to true.

yarn.resourcemanager.ha.rm-ids
List of highly available Resource Manager IDs.

For example, you can use the following values: rm1,rm2.

yarn.resourcemanager.hostname
The host name for the Resource Manager.

yarn.resourcemanager.hostname.<ResourceManagerID>
Host name for one of the highly available Resource Managers specified in
yarn.resourcemanager.ha.rm-ids.

Each Resource Manager requires a separate entry. For example, if you have two Resource Managers,
you must have two corresponding yarn.resourcemanager.hostname.<ResourceManagerID> properties.

The following sample text shows a Resource Manager with the ID rm1:
yarn.resourcemanager.hostname.rm1.

116 Chapter 4: High Availability


yarn.resourcemanager.webapp.address.<ResourceManagerID>
The HTTP address for the web application of one of the Resource Managers you specified in
yarn.resourcemanager.ha.rm-ids.

Each Resource Manager requires a separate entry.

yarn.resourcemanager.scheduler.address
The address of the scheduler interface.

yarn.resourcemanager.scheduler.address.<ResourceManagerID>
The address of the scheduler interface for one of the highly available Resource Managers.
Each resource manager requires a separate entry.

The following sample text shows the properties for two highly available Resource Managers with the IDs rm1
and rm2:
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>

<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2></value>
</property>

<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01.domain01.com</value>
</property>

<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>node01.domain01.com</value>
</property>

<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node02.domain01.com</value>
</property>

<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node01.domain01.com:8088</value>
</property>

<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>node01.domain01.com:8088</value>
</property>

<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node02.domain01.com:8088</value>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node01.domain01.com:8030</value>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>node01.domain01.com:8030</value>
</property>

<property>

Enable Support for a Highly Available Hortonworks HDP Cluster 117


<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>node02.domain01.com:8030</value>
</property>

Configuring Big Data Management for a Highly Available


Hortonworks HDP Cluster
You can enable Data Integration Service and the Developer tool to read from and write to a highly available
Hortonworks cluster. The Hortonworks cluster provides a highly available NameNode and ResourceManager.

Perform the following steps:

1. Go to the following directory on the NameNode of the cluster:


/etc/hadoop/conf
2. Locate the following files:
• hdfs-site.xml
• yarn-site.xml
3. Note: If you use the Big Data Management Configuration Utility to configure Big Data Management, skip
this step.
Copy the files to the following directory on the machine where the Data Integration Service:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf
4. Edit hadoopEnv.properties and add the following path to the infapdo.aux.jars.path property:
file://$DIS_HADOOP_DIST/conf/hdfs-site.xml.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/hortonworks_<version>/infaConf/
5. Copy the files to the following directory on the machine where the Developer tool runs
<Informatica installation directory>/clients/DeveloperClient/Hadoop/
hortonworks_<version>/conf
6. Open the Developer tool.
7. Click Window > Preferences.
8. Select Informatica > Connections.
9. Expand the domain.
10. Expand Databases and select the Hive connection.
11. Edit the Hive connection and configure the following properties in the Properties to Run Mappings in
Hadoop Cluster tab:
Default FS URI
Use the value from the dfs.nameservices property in hdfs-site.xml.

Job tracker/Yarn Resource Manager URI


Enter any value in the following format: <string>:<port>. For example, enter dummy:12435.
12. Expand File Systems and select the HDFS connection.
13. Edit the HDFS connection and configure the following property in the Details tab:
NameNode URI
Use the value from the dfs.nameservices property in hdfs-site.xml.

118 Chapter 4: High Availability


Configuring Big Data Management for a Highly
Available IBM BigInsights Cluster
You can enable the Data Integration Service and the Developer tool to read from and write to a highly
available BigInsights cluster. The BigInsights cluster provides a highly available NameNode and
ResourceManager.

1. Go to the following directory on the NameNode of the cluster:


/data/ibm/biginsights/hadoop-conf
2. Locate the following files:
• hdfs-site.xml
• core-site.xml
3. Copy the files to the machine on which the Data Integration Service runs and the machine on which the
Developer tool runs:
On the machine on which the Data Integration Service runs, copy the files to the following directory:
<Informatica installation directory>/services/shared/hadoop/biginsights_<version>/conf
On the machine on which the Developer tool runs, copy the files to the following directory:
<Informatica installation directory>/clients/DeveloperClient/Hadoop/
biginsights_<version>/conf
4. On the machine where the Data Integration Service runs, edit core-site.xml.
5. Remove the following property: io.compression.codecs.
You can delete the property or change the property to a comment.
6. Open the Developer tool.
7. Click Window > Preferences.
8. Select Informatica > Connections.
9. Expand the domain.
10. Expand Databases and select the Hive Connection.
11. Edit the Hive connection and configure the following properties in the Properties to Run Mappings in
Hadoop Cluster tab:
Default FS URI
Use the value from the fs.defaultFS property found in core-site.xml.

JobTracker/Yarn Resource Manager URI


Use <cluster_namenode>:9001.

Configuring Informatica for Highly Available MapR


You can enable the Data Integration Service and the Developer tool to read from and write to a highly
available MapR cluster. The MapR cluster on MRv1 provides a highly available NameNode and JobTracker.

1. Go to the following directory on the NameNode of the cluster:


/opt/mapr/conf

Configuring Big Data Management for a Highly Available IBM BigInsights Cluster 119
2. Locate the mapr-cluster.conf file.
3. Copy the file to the machine on which the Data Integration Service runs and the machine on which the
Developer tool client runs:
On the machine on which the Data Integration Service runs, copy the file to the following directory:
<Informatica installation directory>/services/shared/hadoop/mapr_<version>/conf
On the machine on which the Developer tool runs, copy the file to the following directory:
<Informatica installation directory>/clients/DeveloperClient/Hadoop/mapr_<version>/conf
4. Open the Developer tool.
5. Click Window > Preferences.
6. Select Informatica > Connections.
7. Expand the domain.
8. Expand File Systems and select the HDFS connection.
9. Edit the HDFS connection and configure the following property in the Details tab:
NameNode URI
Use the value of the dfs.nameservices property.

You can get the value of the dfs.nameservices property from hdfs-site.xml from the following
location on the NameNode of the cluster: /etc/hadoop/conf

120 Chapter 4: High Availability


APPENDIX A

Upgrade Big Data Management


You can upgrade Big Data Management on the Hadoop cluster.

When you upgrade Big Data Management, you uninstall the previous Big Data Management RPMs and install
the new version.

Upgrading Big Data Management


You can upgrade Big Data Management. When you upgrade Big Data Management, you should back up the
configuration files before you start the upgrade process. After you upgrade, configure Big Data Management.

1. Verify that the Informatica domain and client tools are upgraded.
2. Uninstall the Big Data Management RPM package.
For more information about how to uninstall Big Data Management, see “Informatica Big Data
Management Uninstallation” on page 20
Note: If you used Cloudera Manager parcels to install Big Data Management, skip this step.
3. Install Big Data Management.
For more information about how to install Big Data Management, see “Installation Overview” on page 10
4. Configure Big Data Management.
Complete the tasks in “Post-Installation Overview” on page 22 and Chapter 3, “Configuring Big Data
Management to Run Mappings in Hadoop Environments” on page 41 for your Hadoop distribution.
5. Configure the Developer tool.
For more information, see “Enable Developer Tool Communication with the Hadoop Cluster ” on page 25
6. Optionally, configure Big Data Management to connect to a highly available Hadoop cluster.
For more information, see “Configure High Availability” on page 112

121
APPENDIX B

Configure Ports for Big Data


Management
When you install and configure Big Data Management, the installer utility opens ports by default on domain
and cluster nodes. You must open other ports manually. This section lists the ports and the processes that
they serve.

Informatica Domain and Application Services


The Informatica domain includes several services that perform important roles in data extraction and
processing.

For more information about application services, see the Informatica 10.1 Application Service Guide.

Application Services and Ports


Informatica domain services and application services in the Informatica domain have unique ports.

Informatica Domain
The following table lists the default port associated with the Informatica domain:

Type Default Port

Domain configuration Default is 6005. You can change the default port when during installation. You can
modify the port after installation with the infasetup updateGatewayNode command.

Service Manager 6006

Service Manager Shutdown 6007

Informatica Administrator (HTTP) 6008

Informatica Administrator (HTTPS) 8443

Informatica Administrator shutdown 6009

122
Type Default Port

Service Process (Minimum) 6013

Service Process (Maximum) 6113

Analyst Service
The following table lists the default port associated with the Analyst Service:

Type Default Port

Analyst Service (HTTP) 8085

Analyst Service (HTTPS) No default port. Enter the required port number when you create the service.

Analyst Service (Staging database) No default port. Enter the database port number.

Content Management Service


The following table lists the default port associated with the Content Management Service:

Type Default Port

Content Management Service (HTTP) 8105

Content Management Service (HTTPS) No default port. Enter the required port number when you create the service.

Data Director Service


The following table lists the default port associated with the Data Director Service:

Type Default Port

Data Director Service (HTTP) No default port. Enter the required port number when you create the service.

Data Director Service (HTTPS) No default port. Enter the required port number when you create the service.

Data Integration Service


The following table lists the default port associated with the Data Integration Service:

Type Default Port

Data Integration Service (HTTP proxy) 8085

Data Integration Service (HTTP) 8095

Data Integration Service (HTTPS) No default port. Enter the required port number when you create the service.

Informatica Domain and Application Services 123


Type Default Port

Profiling Warehouse database No default port. Enter the database port number.

Human Task database No default port. Enter the database port number.

Metadata Manager Service


The following table lists the default port associated with the Metadata Manager Service:

Type Default Port

Metadata Manager Service (HTTP) Default is 10250.

Metadata Manager Service (HTTPS) No default port. Enter the required port number when you create the service.

PowerExchange Listener Service

Use the same port number that you specify in the SVCNODE statement of the DBMOVER file.

If you define more than one Listener Service to run on a node, you must define a unique SVCNODE port
number for each service.

PowerExchange Logger Service

Use the same port number that you specify in the SVCNODE statement of the DBMOVER file.

If you define more than one Listener Service to run on a node, you must define a unique SVCNODE port
number for each service.

Web Services Hub Service


The following table lists the default port associated with the Web Services Hub Service:

Type Default Port

Web Services Hub Service (HTTP) 7333

Web Services Hub Service (HTTPS) 7343

124 Appendix B: Configure Ports for Big Data Management


Big Data Management Ports
The following table lists Hadoop components and default port numbers for various Hadoop distributions:

Cloudera 5.x
The following table lists the Cloudera Hadoop components and default port numbers:

Hadoop Component Port Notes

Cloudera Manager 7180

HBase master 60000

HBase master web 60010

HBase region server 60020

HDFS read/write 50010, 50020 Open this port for all data nodes.

Hive metastore 9083

HiveServer 10000

JobTracker 8021

JobTracker web manager 50030

MapReduce Application Master 50100-50200

MapReduce JobHistory server 10020 YARN only

MapReduce JobHistory server webapp 19888 YARN only

MySQL 3306 Required if you use MySQL.

NameNode 8020

ResourceManager 8050 YARN only

ResourceManager webapp 8088 YARN only

ResourceTracker 8031 YARN only

Scheduler address 8030 YARN only

Shuffle HTP 13562

TaskTracker web management 50060

ZooKeeper 2181

Big Data Management Ports 125


Hortonworks 2.x
The following table lists the Hortonworks Hadoop components and default port numbers:

Hadoop Component Port Notes

HBase master 60000

HBase master web 60010

HBase region server 60020

HDFS read/write 50010, 50020 Open this port for all data nodes.

Hive metastore 9933

HiveServer 10000

JobTracker 8021

JobTracker web manager 50030

MapReduce Application Master 50100-50200

MapReduce JobHistory server 10020 YARN only

MapReduce JobHistory server webapp 19888 YARN only

MySQL 3306 Required if you use MySQL.

NameNode 8020

ResourceManager 8032 YARN only

ResourceManager webapp 8088 YARN only

ResourceTracker 8031 YARN only

Scheduler address 8030 YARN only

Shuffle HTP 13562

TaskTracker web management 50060

ZooKeeper 2181

IBM BigInsights 3.x


The following table lists the Big Insights Hadoop components and default port numbers:

Hadoop Component Port Notes

HBase master 60000

HBase master web 60010

126 Appendix B: Configure Ports for Big Data Management


Hadoop Component Port Notes

HBase region server 60020

HDFS read/write 50010, 50020 Open this port for all data nodes.

Hive metastore 9933

HiveServer 10000

JobTracker 9001

JobTracker web manager 50030

MapReduce Application Master 50100-50200

MapReduce JobHistory server 10020 YARN only

MapReduce JobHistory server webapp 19888 YARN only

MySQL 3306 Required if you use MySQL.

NameNode 9000

ResourceManager 8032 YARN only

ResourceManager webapp 8088 YARN only

ResourceTracker 8031 YARN only

Scheduler address 8030 YARN only

Shuffle HTP 13562

TaskTracker web management 50060

ZooKeeper 2181

MapR 5.x
The following table lists the MapR Hadoop components and default port numbers:

Hadoop Component Port Notes

CLDB 7222

CLDB JMX monitor 7220

CLDB web port 7221

HBase master 60000

HBase master web 60010

HBase region server 60020

Big Data Management Ports 127


Hadoop Component Port Notes

Hive metastore 9083

HiveServer 10000

JobTracker 9001

JobTracker web manager 50030

MapReduce Application Master 50100-50200

MapReduce JobHistory server 10020 YARN only

MapReduce JobHistory server webapp 19888 YARN only

MySQL 3306 Required if you use MySQL.

NameNode 8020

ResourceManager 8032 YARN only

ResourceManager webapp 8088 YARN only

ResourceTracker 8031 YARN only

Scheduler address 8030 YARN only

Shuffle HTP 13562

TaskTracker web management 50060

ZooKeeper 5181

Ports for the Blaze Engine


The Blaze engine uses Blaze services and the Blaze Job Monitor.

Blaze Job Monitor


The following table lists the ports you can configure for the Blaze Job Monitor:

Description Port Number

HTTP 9080

JSF 9090

Blaze Services
Blaze services include Grid Manager, Orchestrator, the DEF Client, the DEF Daemon, the OOP Container
manager, and the OOP Container.

The Blaze Grid Manager looks for configured Min and Max ports in the Hadoop connectio, and then starts
services on the available ports from the specified range. Default ports are 12300 to 12600. An administrator
may configure a different range.

128 Appendix B: Configure Ports for Big Data Management


Informatica Developer Ports
Use the Developer tool to build mappings and other objects to access, transform and write from and to big
data sources.

The following table lists the ports that the Developer tool installer opens:

Hadoop Component Port Notes

CLDB 7222 MapR only

HBase master 60000

HBase region server 60200

HiveServer2 thrift 10000

NameNode RPC 8020 All distributions except MapR

ZooKeeper client 2181, 5181

Big Data Management Ports 129


Index

A Hadoop distributions (continued)


MapR 97
Amazon EMR staging directory on HDFS 75
configuring mappings 43 HBase connections
properties 67
HDFS connections

B properties 66
HDInsight
Big Data Management configuring mappings 47
Blaze high availability
configuration 36 NameNode 113
cluster installation 11, 15, 18 ResourceManager 113
cluster pre-installation tasks 12 Hive connections
Data Quality 23 properties 68
HiveServer 2 Hortonworks
configuration 86 configuring mappings 56, 80
single node installation 11, 14, 18
single node pre-installation tasks 12
I
C Informatica adapters
installation and configuration 11
Cloudera Informatica clients
creating a staging directory on HDFS 75 installation and configuration 11
Hadoop cluster properties 74 Informatica services
mapping configuration 73 installation and configuration 11
cluster installation
any machine 16, 19
primary NameNode 15, 18
connections
M
HBase 65 mappings in a Hadoop environment
HDFS 65 Hive variables 42
Hive 65 mappings in a Hive environment
JDBC 65 library path 43
path environment variables 43
MapR

D configuring mappings 97

Data Quality
address reference data files 23
reference data 23
N
Data Replication NameNode
installation and configuration 12 high availability 113

H P
Hadoop 65 primary NameNode
Hadoop distributions FTP protocol 16, 18
Amazon EMR 43 HTTP protocol 16, 18
Cloudera 73 NFS protocol 16, 18
configuration tasks 41 SCP protocol 15, 18
configuring virtual memory limits 76
Developer tool file 25, 45, 50
HDInsight 47
HortonWorks 56, 80

130
R V
ResourceManager vcore 36
high availability 113

S
Sqoop configuration
copying JDBC driver jar files 33

Index 131

Das könnte Ihnen auch gefallen