Quantcast
Channel: Hadoop – Hortonworks
Viewing all 859 articles
Browse latest View live

Round-trip Data Enrichment between Teradata and Hadoop

$
0
0

Hadoop can be a great complement to existing data warehouse platforms, such as Teradata, as it naturally helps to address two key storage challenges:

The purpose of this article is to detail some of the key integration points and to show how data can be easily exchanged for enrichment between the two platforms.

As a data integrator who is familiar with RDBMS systems and is new to the Hadoop platform, I was looking for a simple way (i.e. “SQL-way”) to exchange data with Teradata.  Fortunately, it was just a matter identifying the tools and connecting the dots.

Using pre-built adapters (Teradata Connector for Hadoop, Hortonworks Teradata Connector), SQL-based protocols for data exchange (Sqoop and Hive) and a GUI interface for Apache Hadoop (HUE, Oozie), simple workflows can be created to deliver big value.

First, lets review the key components:

  • Teradata Connector for Hadoop (TDCH). Teradata’s adapter for moving data to\from Hadoop, including data-type mapping, parallel processing capabilities and native loading options.
  • Hortonworks Teradata Connector (HTC). A Sqoop wrapper for TDCH that implements standard data exchange syntax for Hadoop.
  • Apache Sqoop. A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Apache Hive. An Apache project that extends a logical SQL structure over HDFS and supports query using a language called HiveQL.
  • Apache Oozie. A workflow scheduler system to manage Apache Hadoop jobs.

Although the parts are many, the majority of work was performed within HUE, which provided XML and script generation as well as a graphic test environment and logging capabilities.

Step 1:  Setup

The framework that will co-ordinate the data movement and manipulation process is the Oozie workflow engine, so it is here that the configuration begins.

An Oozie workflow consists of an application (deployment) folder and configuration files (workflow.xml and job.properties) that provide the processing instructions and variable definitions for the workflow job.

In HDFS, the application folder td_oozie_demo  and lib sub-directories were created by the hue user account.

# su - hue
# hadoop fs –mkdir td_oozie_demo
# hadoop fs –mkdir td_oozie_demo/lib

Supporting JAR files for each of the technologies we will leverage can be placed in the “lib” sub-directory.

Below is the list of supporting JAR files with relative source locations:

# /usr/lib/sqoop/lib
/td_oozie_demo/lib/avro-1.5.3.jar
/td_oozie_demo/lib/avro-mapred-1.5.3.jar
/td_oozie_demo/lib/commons-io-1.4.jar
/td_oozie_demo/lib/snappy-java-1.0.3.2.jar
/td_oozie_demo/lib/sqoop-1.4.4.2.0.6.0-76.jar
# /usr/lib/hive/lib
/td_oozie_demo/lib/antlr-runtime-3.4.jar
/td_oozie_demo/lib/datanucleus-api-jdo-3.2.1.jar
/td_oozie_demo/lib/datanucleus-core-3.2.2.jar
/td_oozie_demo/lib/derby-10.4.2.0.jar
/td_oozie_demo/lib/hive-cli-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-common-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-exec-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-metastore-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-serde-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-service-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/jackson-core-asl-1.7.3.jar
/td_oozie_demo/lib/jdo-api-3.0.1.jar
/td_oozie_demo/lib/jline-0.9.94.jar
/td_oozie_demo/lib/jopt-simple-3.2.jar
/td_oozie_demo/lib/libfb303-0.9.0.jar
/td_oozie_demo/lib/opencsv-2.3.jar
# https://downloads.teradata.com/download/connectivity/teradata-connector-for-hadoop-sqoop-integration-edition
/td_oozie_demo/lib/tdgssconfig.jar
/td_oozie_demo/lib/teradata-connector-1.1.1-hadoop200.jar
/td_oozie_demo/lib/terajdbc4.jar
 # http://hortonworks.com/products/hdp-2/#add_ons
/td_oozie_demo/lib/hortonworks-teradata-connector-1.1.1.2.0.6.1-101.jar

One more key requirement is to copy the hive-site.xml file to the application folder.  This file provides the mapping between the Oozie workflow and the HCatalog metadata store for Hive.

# hadoop fs –cp /usr/lib/hive/conf/hive-site.xml td_oozie_demo/hive-config.xml

Once the initial setup is in place, the remaining steps can be performed in the HUE graphic interface.   Note that the workflow.xml file mentioned earlier will be auto-generated during “Step 3:  Process Definition”.

Step 2:  Configuration

In a default HDP 2.0 installation, the HUE interface can be reached from the following URL: http://<hostname or ip>:8000

From the HUE website, browse to the Oozie Editor and create a new Workflow:

tdrt1

Edit the Workflow properties to specify the application directory – td_oozie_demo – and configuration file – hive-config.xml.

tdrt2

Step 3:  Process Definition

The Oozie Editor support several types of workflow “actions” that can be coordinated, for this example we will focus on two types:  Hive and Sqoop.

The workflow will consist of the following process steps:

  1. Hive:  Create the target Hive table that will accept import from Teradata
  2. Sqoop:  Perform the import from Teradata to Hive
  3. Hive:  Generate an aggregated result set that enriches imported Teradata records with existing Hive\Hadoop data
  4. Sqoop:  Perform an export of the Hive result set to Teradata

Process step (a) will run the following SQL script – hive­_script1.sql:

DROP TABLE IF EXISTS CLIENT;

CREATE TABLE CLIENT (
	C_CUSTKEY INT,
	C_NAME STRING,
	C_ADDRESS STRING,
	C_NATIONKEY INT,
	C_PHONE STRING,
	C_ACCTBAL DOUBLE,
	C_MKTSEGMENT STRING,
	C_COMMENT STRING
)
ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\t';

Using the workflow editor, Add a “Hive” action and upload the hive_script1.sql file:

tdrt3

Process step 2 will perform a SQOOP import from the Teradata table “CLIENT” to the Hive table “client” that was created in the previous step.

Using the workflow editor, Add a “Sqoop” action and specify the IMPORT command:

import --connect jdbc:teradata://<HOSTNAME OR IP ADDRESS>/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table CLIENT --hive-table client

Note:  This example connects to the Teradata Studio Express 14 VM available from Teradata.

tdrt4

Process step 3 will run the following SQL script – hive­_script2.sql – that generates an aggregated result set from the import table (client) and existing tables (contract, item):

DROP TABLE IF EXISTS CUST_BY_YEAR;

CREATE TABLE IF NOT EXISTS CUST_BY_YEAR(
	C_CUSTKEY INT,
	C_NAME STRING,
	C_YEAR INT,
	C_TOTAL_BY_YEAR DOUBLE
)
ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\t';
  	
INSERT OVERWRITE TABLE CUST_BY_YEAR 
select c.c_custkey, c.c_name, substr(o_orderdate,0,4) AS o_year, 
sum(l_extendedprice) AS o_year_total
FROM client c JOIN
contract o ON
c.c_custkey = o.o_custkey
JOIN
item l ON
o.o_orderkey = l.l_orderkey
GROUP BY
c.c_custkey, c.c_name, substr(o_orderdate,0,4);

Using the workflow editor, Add a “Hive” action and upload the hive_script2.sql file:

tdrt5

Process step 4 will perform a Sqoop export from the Hive table cust_by_year to the Teradata table HDP_IMPORT.CUST_BY_YEAR.

Using the workflow editor, Add a “Sqoop” action and specify the IMPORT command:

export --connect jdbc:teradata://&lt;HOST OR IP ADDRESS&gt;/Database=HDP_IMPORT --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table CUST_BY_YEAR --export-dir /apps/hive/warehouse/cust_by_year --input-fields-terminated-by "\t"

Note:  The DDL for the target Teradata table CUST_BY_YEAR is:

CREATE SET TABLE HDP_IMPORT.CUST_BY_YEAR 
(
C_CUSTKEY INTEGER NOT NULL,
C_NAME VARCHAR(25) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
C_YEAR INTEGER NOT NULL,
C_YEARTOTAL DECIMAL(15,2) NOT NULL);

tdrt6

Upon completion of the workflow definition, the following workflow.xml is auto-generated in the Oozie application folder:

	<workflow-app name="Teradata Oozie Demo" xmlns="uri:oozie:workflow:0.4">
		<global>
			<job-xml>/td_oozie_demo/hive-config.xml</job-xml>
		</global>
		<start to="create_hive_target"/>
			<action name="create_hive_target">
				<hive xmlns="uri:oozie:hive-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
					<script>/td_oozie_demo/hive_script1.sql</script>
				</hive>
				<ok to="td_to_hive"/>
				<error to="kill"/>
			</action>
			<action name="td_to_hive">
				<sqoop xmlns="uri:oozie:sqoop-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
<command>import --connect jdbc:teradata://192.168.1.13/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table CLIENT --hive-table client</command>
				</sqoop>
				<ok to="populate_hive_results"/>
				<error to="kill"/>
			</action>
			<action name="populate_hive_results">
				<hive xmlns="uri:oozie:hive-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
					<script>/td_oozie_demo/hive_script2.sql</script>
				</hive>
				<ok to="hive_to_td"/>
				<error to="kill"/>
			</action>
			<action name="hive_to_td">
				<sqoop xmlns="uri:oozie:sqoop-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
<command>export --connect jdbc:teradata://192.168.1.13/Database=HDP_IMPORT --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table CUST_BY_YEAR --export-dir /apps/hive/warehouse/cust_by_year --input-fields-terminated-by &quot;\t&quot;</command>
				</sqoop>
				<ok to="end"/>
				<error to="kill"/>
			</action>
			<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
			</kill>
			<end name="end"/>
		</workflow-app>

Step 4:  Testing and Execution

Using the Oozie Workflow Manager, the newly created job can be submitted for execution.

tdrt7

The workflow actions can be tracked and analyzed, each action maintaining it’s own process log file.  The interface also provides a “Rerun” option that allows for unit testing should any action fail to complete.

tdrt8

Final thoughts

In this example, records from Teradata are imported to Hadoop.  SQL-like processing occurs in a series of Oozie Workflow “actions” in the Hadoop platform.  A smaller, aggregated result is then returned to Teradata.

By leveraging Hadoop and its SQL integration points, it is possible to off-load storage and processing demands from a data warehouse to a Big Data store.

Integration between Teradata and the Hortonworks Data Platform (HDP) is not limited to the example in this article.  SQL-H integration for “on-demand” data exchange with Hadoop, Viewpoint \ Ambari integration for monitoring and management of Hadoop are both powerful tools that are available within the Teradata suite of products.

This article’s focus was on the integration features within HDP and these same tools (Oozie, Sqoop and Hive) can interact with almost any RDBMS that supports JDBC.

Additional Notes:

This integration demo was setup between two downloadable VMs:

  1. Hortonworks – HDP 2.0 Sandbox
  2. Teradata Studio Express v14

Data for the Hive tables, “contract” and “item”, were originally sourced from the Teradata Studio Express “retail” DB, using SQOOP.

sqoop import -libjars ${LIB_JARS} --connect jdbc:teradata://<HOST OR IP ADDRESS>/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table ITEM --hive-table item

sqoop import -libjars ${LIB_JARS} --connect jdbc:teradata://<HOST OR IP ADDRESS/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table CONTRACT --hive-table contract

Target Hive table DDL for the Teradata Retail tables CONTRACT and ITEMS:

CREATE TABLE IF NOT EXISTS ITEM (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ‘\t’;

CREATE TABLE IF NOT EXISTS CONTRACT (o_orderkey int, o_custkey int, o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string, o_clerk string, o_shippriority int, o_comment string)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ‘\t’;

See documentation from Hortonworks related to the Hortonworks Teradata Connector for more information on Sqoop integration with Teradata.

The post Round-trip Data Enrichment between Teradata and Hadoop appeared first on Hortonworks.


HBase 0.98.0 is Released

$
0
0

With over 230 JIRA tickets resolved, the Apache HBase community released 0.98.0 yesterday which is the next major version after 0.96.x series.

HBase 0.98.0 comes with an exciting set of new features with keeping the same stability improvements and features on top of 0.96. Additional to usual bug fixes, some of the major improvements include:

  • Reverse Scans (HBASE-4811): for use cases where both forward and reverse iteration is required, HBase now allows to perform scans in reverse mode. This is also as efficient as forward scanning with a small percentile difference in performance. Using it is as simple as:

Scan scan = new Scan();
scan.setReversed(true);

  • Stripe Compaction and other compaction improvements (HBASE-7667): This release brings a lot of the work that went into making compactions and flushes pluggable (with StorageEngine), and also a new compaction policy called “stripe compactions”. Under this policy each region is automatically sharded into sub ranges which are compacted individually. You can find more information in the HBaseCon 2013 talk.

  • MapReduce over snapshots (HBASE-8369): This features leverages HBase snapshots to implement pure client side scanning from data files in hfds. Similar to short circuit hdfs reads, the client can bypass the whole HBase server layer, and can stream the scan results from the java application or mapreduce which brings 5x the scan speeds. Learn more about it from this talk.

  • Cell level ACL’s and visibility tags (HBASE-6222): This feature is similar to Apache Accumulo, where access control can be provided and enforced per cell. You can find additional information for how to use this feature in this blog post, and the security section of the HBase book.

  • Transparent server side encryption (HBASE-7544): This adds the ability to store hfiles and the write ahead logs in encrypted format. More information can be found here

  • Other improvements: 0.98.0 also contains some nice performance improvements, mainly a new WAL threading model (HBASE-8755), streaming scans from REST (HBASE-9343). A notable correctness fix include truly idempotent increments (HBASE-3787)

0.98.0 is wire compatible with 0.96.x releases, and a rolling upgrade should be sufficient for upgrading from 0.96.x. Upgrading from 0.94.x releases is also supported, but the cluster has to be shutdown and migrated. More information on the upgrade can be found at the HBase book.

You can download the new release from here and the full release notes are located here. Upcoming HDP-2.1 will also be based off HBase-0.98.0.

Last, but not least, we would like to thank Andrew Purtell from Intel, who is the release manager of 0.98 branch and all the developers from many organizations who have contributed to or tested this release.

Keep HBase’ing.

The post HBase 0.98.0 is Released appeared first on Hortonworks.

HBase BlockCache 101

$
0
0

This blog post originally appeared here and is reproduced in its entirety here.

HBase is a distributed database built around the core concepts of an ordered write log and a log-structured merge tree. As with any database, optimized I/O is a critical concern to HBase. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance. To this end, HBase maintains two cache structures: the “memory store” and the “block cache”. Memory store, implemented as the MemStore, accumulates data edits as they’re received, buffering them in memory (1). The block cache, an implementation of the BlockCache interface, keeps data blocks resident in memory after they’re read.

The MemStore is important for accessing recent edits. Without the MemStore, accessing that data as it was written into the write log would require reading and deserializing entries back out of that file, at least a O(n)operation. Instead, MemStore maintains a skiplist structure, which enjoys a O(log n) access cost and requires no disk I/O. The MemStore contains just a tiny piece of the data stored in HBase, however.

Servicing reads from the BlockCache is the primary mechanism through which HBase is able to serve random reads with millisecond latency. When a data block is read from HDFS, it is cached in the BlockCache. Subsequent reads of neighboring data – data from the same block – do not suffer the I/O penalty of again retrieving that data from disk (2). It is the BlockCache that will be the remaining focus of this post.

Blocks to cache

Before understanding the BlockCache, it helps to understand what exactly an HBase “block” is. In the HBase context, a block is a single unit of I/O. When writing data out to an HFile, the block is the smallest unit of data written. Likewise, a single block is the smallest amount of data HBase can read back out of an HFile. Be careful not to confuse an HBase block with an HDFS block, or with the blocks of the underlying file system – these are all different (3).

HBase blocks come in 4 varietiesDATAMETAINDEX, and BLOOM.

DATA blocks store user data. When the BLOCKSIZE is specified for a column family, it is a hint for this kind of block. Mind you, it’s only a hint. While flushing the MemStore, HBase will do its best to honor this guideline. After each Cell is written, the writer checks if the amount written is >= the target BLOCKSIZE. If so, it’ll close the current block and start the next one (4).

INDEX and BLOOM blocks serve the same goal; both are used to speed up the read path. INDEX blocks provide an index over the Cells contained in the DATA blocks. BLOOM blocks contain a bloom filter over the same data. The index allows the reader to quickly know where a Cell should be stored. The filter tells the reader when a Cell is definitely absent from the data.

Finally, META blocks store information about the HFile itself and other sundry information – metadata, as you might expect. A more comprehensive overview of the HFile formats and the roles of various block types is provided in Apache HBase I/O – HFile.

HBase BlockCache and its implementations

There is a single BlockCache instance in a region server, which means all data from all regions hosted by that server share the same cache pool (5). The BlockCache is instantiated at region server startup and is retained for the entire lifetime of the process. Traditionally, HBase provided only a single BlockCache implementation: the LruBlockCache. The 0.92 release introduced the first alternative in HBASE-4027: the SlabCache. HBase 0.96 introduced another option via HBASE-7404, called the BucketCache.

The key difference between the tried-and-true LruBlockCache and these alternatives is the way they manage memory. Specifically, LruBlockCache is a data structure that resides entirely on the JVM heap, while the other two are able to take advantage of memory from outside of the JVM heap. This is an important distinction because JVM heap memory is managed by the JVM Garbage Collector, while the others are not. In the cases of SlabCache and BucketCache, the idea is to reduce the GC pressure experienced by the region server process by reducing the number of objects retained on the heap.

LruBlockCache

This is the default implementation. Data blocks are cached in JVM heap using this implementation. It is subdivided into three areas: single-access, multi-access, and in-memory. The areas are sized at 25%, 50%, 25% of the total BlockCache size, respectively (6). A block initially read from HDFS is populated in the single-access area. Consecutive accesses promote that block into the multi-access area. The in-memory area is reserved for blocks loaded from column families flagged as IN_MEMORY. Regardless of area, old blocks are evicted to make room for new blocks using a Least-Recently-Used algorithm, hence the “Lru” in “LruBlockCache”.

SlabCache

This implementation allocates areas of memory outside of the JVM heap using DirectByteBuffers. These areas provide the body of this BlockCache. The precise area in which a particular block will be placed is based on the size of the block. By default, two areas are allocated, consuming 80% and 20% of the total configured off-heap cache size, respectively. The former is used to cache blocks that are approximately the target block size (7). The latter holds blocks that are approximately 2x the target block size. A block is placed into the smallest area where it can fit. If the cache encounters a block larger than can fit in either area, that block will not be cached. Like LruBlockCache, block eviction is managed using an LRU algorithm.

BucketCache

This implementation can be configured to operate in one of three different modes: heapoffheap, and file. Regardless of operating mode, the BucketCache manages areas of memory called “buckets” for holding cached blocks. Each bucket is created with a target block size. The heap implementation creates those buckets on the JVM heap; offheap implementation uses DirectByteByffers to manage buckets outside of the JVM heap; filemode expects a path to a file on the filesystem wherein the buckets are created. file mode is intended for use with a low-latency backing store – an in-memory filesystem, or perhaps a file sitting on SSD storage (8). Regardless of mode, BucketCache creates 14 buckets of different sizes. It uses frequency of block access to inform utilization, just like LruBlockCache, and has the same single-access, multi-access, and in-memory breakdown of 25%, 50%, 25%. Also like the default cache, block eviction is managed using an LRU algorithm.

Multi-Level Caching

Both the SlabCache and BucketCache are designed to be used as part of a multi-level caching strategy. Thus, some portion of the total BlockCache size is allotted to an LruBlockCache instance. This instance acts as the first level cache, “L1,” while the other cache instance is treated as the second level cache, “L2.” However, the interaction between LruBlockCache and SlabCache is different from how the LruBlockCache and the BucketCache interact.

The SlabCache strategy, called DoubleBlockCache, is to always cache blocks in both the L1 and L2 caches. The two cache levels operate independently: both are checked when retrieving a block and each evicts blocks without regard for the other. The BucketCache strategy, called CombinedBlockCache, uses the L1 cache exclusively for Bloom and Index blocks. Data blocks are sent directly to the L2 cache. In the event of L1 block eviction, rather than being discarded entirely, that block is demoted to the L2 cache.

Which to choose?

There are two reasons to consider enabling one of the alternative BlockCache implementations. The first is simply the amount of RAM you can dedicate to the region server. Community wisdom recognizes the upper limit of the JVM heap, as far as the region server is concerned, to be somewhere between 14GB and 31GB (9). The precise limit usually depends on a combination of hardware profile, cluster configuration, the shape of data tables, and application access patterns. You’ll know you’ve entered the danger zone when GC pauses and RegionTooBusyExceptions start flooding your logs.

The other time to consider an alternative cache is when response latency really matters. Keeping the heap down around 8-12GB allows the CMS collector to run very smoothly (10), which has measurable impact on the 99th percentile of response times. Given this restriction, the only choices are to explore an alternative garbage collector or take one of these off-heap implementations for a spin.

This second option is exactly what I’ve done. In my next post, I’ll share some unscientific-but-informative experiment results where I compare the response times for different BlockCache implementations.

As always, stay tuned and keep on with the HBase!


1: The MemStore accumulates data edits as they’re received, buffering them in memory. This serves two purposes: it increases the total amount of data written to disk in a single operation, and it retains those recent changes in memory for subsequent access in the form of low-latency reads. The former is important as it keeps HBase write chunks roughly in sync with HDFS block sizes, aligning HBase access patterns with underlying HDFS storage. The latter is self-explanatory, facilitating read requests to recently written data. It’s worth pointing out that this structure is not involved in data durability. Edits are also written to the ordered write log, the HLog, which involves an HDFS append operation at a configurable interval, usually immediate.

2: Re-reading data from the local file system is the best-case scenario. HDFS is a distributed file system, after all, so the worst case requires reading that block over the network. HBase does its best to maintain data locality. These two articles provide an in-depth look at what data locality means for HBase and how its managed.

3: File system, HDFS, and HBase blocks are all different but related. The modern I/O subsystem is many layers of abstraction on top of abstraction. Core to that abstraction is the concept of a single unit of data, referred to as a “block”. Hence, all three of these storage layers define their own block, each of their own size. In general, a larger block size means increased sequential access throughput. A smaller block size facilitates faster random access.

4: Placing the BLOCKSIZE check after data is written has two ramifications. A single Cell is the smallest unit of data written to a DATA block. It also means a Cell cannot span multiple blocks.

5: This is different from the MemStore, for which there is a separate instance for every region hosted by the region server.

6: Until very recently, these memory partitions were statically defined; there was no way to override the 25/50/25 split. A given segment, the multi-access area for instance, could grow larger than it’s 50% allotment as long as the other areas were under-utilized. Increased utilization in the other areas will evict entries from the multi-access area until the 25/50/25 balance is attained. The operator could not change these default sizes. HBASE-10263, shipping in HBase 0.98.0, introduces configuration parameters for these sizes. The flexible behavior is retained.

7: The “approximately” business is to allow some wiggle room in block sizes. HBase block size is a rough target or hint, not a strictly enforced constraint. The exact size of any particular data block will depend on the target block size and the size of the Cell values contained therein. The block size hint is specified as the default block size of 64kb.

8: Using the BucketCache in file mode with a persistent backing store has another benefit: persistence. On startup, it will look for existing data in the cache and verify its validity.

9: As I understand it, there’s two components advising the upper bound on this range. First is a limit on JVM object addressability. The JVM is able to reference an object on the heap with a 32-bit relative address instead of the full 64-bit native address. This optimization is only possible if the total heap size is less than 32GB. See Compressed Oops for more details. The second is the ability of the garbage collector to keep up with the amount of object churn in the system. From what I can tell, the three sources of object churn are MemStoreBlockCache, and network operations. The first is mitigated by the MemSlab feature, enabled by default. The second is influenced by the size of your dataset vs. the size of the cache. The third cannot be helped so long as HBase makes use of a network stack that relies on data copy.

10: Just like with 8, this is assuming “modern hardware”. The interactions here are quite complex and well beyond the scope of a single blog post.

The post HBase BlockCache 101 appeared first on Hortonworks.

Apache Hadoop 2.3.0 Released!

$
0
0

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.3.0!

hadoop-2.3.0 is the first release for the year 2014, and brings a number of enhancements to the core platform, in particular to HDFS.

With this release, there are two significant enhancements to HDFS:

  • Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
  • In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)

With support for heterogeneous storage classes in HDFS, we now can take advantage of different storage types on the same Hadoop clusters. Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, SSDs, Memory etc. More details on this major enhancement are available here.

Along similar lines, it is now possible to use memory available in the Hadoop cluster to centrally cache and administer data-sets in-memory in the Datanode’s address space. Applications such as MapReduce, Hive, Pig etc. can now request for memory to be cached (for the curios, we use a combination of mmap, mlock to achieve this) and then read it directly off the Datanode’s address space for extremely efficient scans by avoiding disk altogether. As an example, Hive is taking advantage of this feature by implementing an extremely efficient zero-copy read path for ORC files – see HIVE-6347 for details.

In YARN, we are very excited to see that ResourceManager Automatic Failover (YARN-149) is nearly complete; even it isn’t ready for primetime yet. We expect it to land by the next release i.e. hadoop-2.4. Furthermore, a number of key operational enhancements have been driven into YARN such as better logging, error-handling, diagnostics etc.

On the MapReduce side of the house, a key enhancement is MAPREDUCE-4421; with this we now no longer need to install MapReduce binaries on every machine and can just use a MapReduce tarball via the YARN DistributedCache by copying it into HDFS.

Of course, a number of bug-fixes, enhancements etc. have also made it into hadoop-2.3; thereby continuing to improve the core platform. Please see hadoop-2.3.0 release notes for more details.

Looking Ahead to Apache Hadoop 2.4.0

With hadoop-2.3 the community has again delivered major upgrade to the platform. Looking ahead a number of exciting features are shaping up for Apache Hadoop 2.4 such as:

  • Support for ACLs in HDFS (HDFS-4685)
  • Key operability features such as support for Rolling Upgrades in HDFS (HDFS-5535) and  FSImage being enhanced to use ProtoBufs (HDFS-5698).
  • YARN ResourceManager Automatic Failover (YARN-149)
  • YARN Generic Application Timeline (YARN-1530) & History (YARN-321) services to make it significantly easier to develop and manage new frameworks and services in YARN.

Acknowledgements

Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community. Just for the reader’s edification it is instructive to note that hadoop-2.3.0 has 560 JIRAs fixed. Of these, 138 are in Hadoop Common, 203 made it to HDFS, 148 are in YARN and 71 went into MapReduce. So, thank you to every single one of the contributors, reviewers and testers!

In particular I’d like to call out the following folks: Arpit Agarwal, Tsz Wo Sze for their work on Heterogeneous Storage; Andrew Wang, Colin McCabe and Chris Nauroth for their efforts on In-Memory Datanode Cache; Jason Lowe for his work on forklifting MapReduce to deploy via the DistributedCache and several folks from Twitter such as Gera Shegalov, Lohit V., Joep R. and others their for a number of unsung, but very key operational enhancements and fixes to YARN.

The post Apache Hadoop 2.3.0 Released! appeared first on Hortonworks.

How To Configure Elasticsearch on Hadoop with HDP

$
0
0

Elasticsearch’s engine integrates with Hortonworks Data Platform 2.0 and YARN to provide real-time search and access to information in Hadoop.

See it in action:  register for the Hortonworks and Elasticsearch webinar on March 5th 2014 at 10 am PST/1pm EST to see the demo and an outline for best practices when integrating Elasticsearch and HDP 2.0 to extract maximum insights from your data.  Click here to register for this exciting and informative webinar!

Try it yourself: Get started with this tutorial using Elasticsearch and Hortonworks Data Platform, or Hortonworks Sandbox to access server logs in Kibana using Apache Flume for ingestion.

Architecture

Following diagram depicts the proposed architecture to index the logs in near real-time into Elasticsearch and also save to Hadoop for long-term batch analytics.

es1

Components

Elasticsearch

Elasticsearch is a search engine that can index new documents in near real-time and make them immediately available for querying. Elasticsearch is based on Apache Lucene and allows for setting up clusters of nodes that store any number of indices in a distributed, fault-tolerant way. If a node disappears, the cluster will rebalance the (shards of) indices over the remaining nodes. You can configure how many shards make up each index and how many replicas of these shards there should be. If a master shard goes offline, one of the replicas is promoted to master and used to repopulate another node.

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into different storage destinations like Hadoop Distributed File System. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

Kibana

Kibana is an open source (Apache Licensed), browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful

System Requirements

  • Hadoop: Hortonworks Data Platform 2.0(HDP 2.0) or HDP Sandbox for HDP 2.0
  • OS: 64 bit RHEL (Red Hat Enterprise Linux) 6, CentOS, Oracle Linux 6
  • Software:  yum, rpm, unzip, tar, wget, java
  • JDK: Oracle 1.7 64, Oracle 1.6 update 31, Open JDK 7

Java Installation

Note: Define the JAVA_HOME environment variable and add the Java Virtual Machine and the Java binaries to your PATH environment variable.

Execute the following command to verify that the Java is in the PATH:

export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH
java -version

Flume Installation

Execute the following commands to install flume binaries and agent scripts
yum install flume-agent flume

Elasticsearch Installation

Latest Elasticsearch can be downloaded from the following URL http://www.elasticsearch.org/download/

RPM Downloads can be found in https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm

To install Elasticsearch on data nodes:
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm

rpm -ivh elasticsearch-0.90.7.noarch.rpm

Setup and configure Elasticsearch

Update the following properties in  /etc/elasticsearch/elasticsearch.yml

  • Set cluster name node.name: "logsearch"
  • Set node name node.name: "node1"
  • By default every node is eligible to be master and stores data. Properties can be adjusted by
    • node.master: true
    • node.data: true
  • Number of shards can be adjusted by following property index.number_of_shards: 5
  • Number of replicas (Additional copies) can be set with index.number_of_replicas : 1
  • Adjust the path of data with path.data: /data1,/data2,/data3,/data4
  • Set to ensure a node sees N other master eligible nodes to be considered. This property needs to be set based on the size of the nodes discovery.zen.minimum_master_nodes: 1
  • Set the time to wait for ping responses from other nodes when discovering. Value needs to be higher for slow or congested network discovery.zen.ping.timeout: 3s
  • Disable the following, only if multicast is not supported in the network discovery.zen.ping.multicast.enabled: false

Note: Configure an initial list of master nodes in the cluster, if multicast is disabled discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

Logging properties can be adjusted in /etc/elasticsearch/logging.yml. The default log location is: /var/log/elasticsearch

Starting and Stopping Elasticsearch

  • To start Elasticsearch /etc/init.d/elasticsearch start
  • To stop Elasticsearch/etc/init.d/elasticsearch stop

Kibana Installation

Download the Kibana binaries from the following URL https://download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz

wget https://download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz

Extract archive with tar –zxvf kibana-3.0.0milestone4.tar.gz

Setup and configure Kibana

  • Open config.js file under the extracted directory
  • Set the elasticsearch parameter to the fully qualified hostname or IP of your Elasticsearch server.
  • elasticsearch: http://<YourIP>:9200
  • Open index.html in your browser to access Kibana UI
  • Update the logstash index pattern to Flume supported index pattern
  • Edit app/dashboards/logstash.json and replace all occurences of [logstash-]YYYY.MM.DD with [logstash-]YYYY-MM-DD

Setup and configure Flume

For demonstration purpose, lets setup and configure a Flume agent on a host where log file needs to be consumed with the following Flume configuration.

Create plugins.d directory and copy the Elasticsearch dependencies:

mkdir /usr/lib/flume/plugins.d
cp $elasticsearch_home/lib/elasticsearch-0.90*jar /usr/lib/flume/plugins.d
cp $elasticsearch_home/lib/lucene-core-*jar /usr/lib/flume/plugins.d

Update Flume configuration to consume a local file and index into Elasticsearch in logstash format. Note: in a real-world use cases, Flume Log4j Appender, Syslog TCP Source, Flume Client SDK, Spool Directory Source are preferred over tailing logs.

agent.sources = tail
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.sources.tail.channels = memoryChannel
agent.sources.tail.type = exec
agent.sources.tail.command = tail -F /tmp/es_log.log
agent.sources.tail.interceptors=i1 i2 i3
agent.sources.tail.interceptors.i1.type=regex_extractor
agent.sources.tail.interceptors.i1.regex = (\\w.*):(\\w.*):(\\w.*)\\s
agent.sources.tail.interceptors.i1.serializers = s1 s2 s3
agent.sources.tail.interceptors.i1.serializers.s1.name = source
agent.sources.tail.interceptors.i1.serializers.s2.name = type
agent.sources.tail.interceptors.i1.serializers.s3.name = src_path
agent.sources.tail.interceptors.i2.type=org.apache.flume.interceptor.TimestampInterceptor$Builder
agent.sources.tail.interceptors.i3.type=org.apache.flume.interceptor.HostInterceptor$Builder
agent.sources.tail.interceptors.i3.hostHeader = host
agent.sinks = elasticsearch
agent.sinks.elasticsearch.channel = memoryChannel
agent.sinks.elasticsearch.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink
agent.sinks.elasticsearch.batchSize=100
agent.sinks.elasticsearch.hostNames = 172.16.55.129:9300
agent.sinks.elasticsearch.indexName = logstash
agent.sinks.elasticsearch.clusterName = logsearch
agent.sinks.elasticsearch.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer

Prepare sample data for a simple test

Create a file /tmp/es_log.log with the following data

website:weblog:login_page weblog data1
website:weblog:profile_page weblog data2
website:weblog:transaction_page weblog data3
website:weblog:docs_page weblog data4
syslog:syslog:sysloggroup syslog data1
syslog:syslog:sysloggroup syslog data2
syslog:syslog:sysloggroup syslog data3
syslog:syslog:sysloggroup syslog data4

Restart Flume

/etc/init.d/flume-agent restart

Searching and Dashboarding with Kibana

Open the $KIBANA_HOME/index.html in browser. By default the welcome page is shown.

es2
Click on “Logstash Dashboard”  and select the appropriate time range to look at the charts based on the time stamped fields.

es3

These screen shots show various available charts on search fields. e.g. Pie, Bar, Table charts

es4 es5

Content can be searched with custom filters and graphs can be plotted based on the search results as shown below.

es6

Batch Indexing using MapReduce/Hive/Pig

Elasticsearch’s real-time search and analytics are natively integrated with Hadoop. and support MapReduceCascadingHive and Pig.

Component Implementation notes
MR2/YARN ESInputFormatESOutputFormat Mapreduce input and out formats are provided by the library
Hive org.elasticsearch.hadoop.hive.ESStorageHandler Hive SerDe implementation
Pig org.elasticsearch.hadoop.pig.ESStorage Pig storage handler

Detailed Documentation with examples related to Elasticsearch hadoop integration can be found in the following URL https://github.com/elasticsearch/elasticsearch-hadoop

Thoughts on Best Practices

  1. Always set minimum_master_nodes to higher to avoid split brain (number of nodes / 2 + 1)
  2. discovery.zen.minimum_master_nodes should be set to something like N/2 + 1 where N is the number of available master nodes.
  3. Set action.disable_delete_all_indices to disable accidental deletes
  4. Set gateway.recover_after_nodes to appropriate number of nodes need to be up before the recovery process starts replicating data around the cluster.
  5. Relax the real time aspect from 1 second to something a bit higher (index.engine.robin.refresh_interval).
  6. Increase the memory allocated to Elasticsearch node. By default its 1g.
  7. Use Java 7 if possible for better performance with elastic search
  8. Set index.fielddata.cache: soft to avoid OutOfMemory errors
  9. Use higher batch sizes in flume sink for higher throughput. E.g 1000
  10. Increase the open file limits for Elasticsearch

The post How To Configure Elasticsearch on Hadoop with HDP appeared first on Hortonworks.

Innovations and Contributions: Apache Hadoop

$
0
0

The Apache Software Foundation (ASF) provides valuable stewardship and guide-rails for projects interested in attracting the broadest community of involvement as possible, especially across a wide range of vendors and end users. While the ASF’s role is not about guaranteeing wild success for every project, they do a great job of providing a place where the broadest community of people, ideas, and code can come together and raise an elephant, so to speak.

These sentiments were expressed quite nicely in an article by Andy Oliver:

Hadoop is everything an Apache project should be: a community of rival companies, an increasing activity level, and an increasing number of committers…This is Apache at its very finest. It will be messy and there will be kerfuffles, but how else and where else could this happen? Where else could Hadoop be both open source and inaugurate the next stage of the InterWebs? In some ways Hadoop is in fact the successor to the Apache Web Server — or maybe the realization of what it started.

The Absolute Transparency Of It All

One of the remarkable aspects of open source development under the stewardship of the ASF is the absolute transparency of it all.  Want to know how many lines of code have been contributed to a project? Or which committers are contributing to a particular project?  It’s all there…in the open.

With that in mind, a recent blog post caught our eye from a Hadoop contributor from Japan (thank you @ajis_ka) who wrote a simple query and produced a handful of graphs identifying some of the interesting trends regarding development within the Apache Hadoop project.

While not surprising, the post drew a couple of key conclusions:

  1. The pace of contributions to Apache Hadoop from 2011 through 2013 remains healthy and vibrant.
  2. While a diverse group is contributing, Hortonworks engineers continue to deliver a significant share (which is not a surprise since we are maniacally focused on driving innovation in the open).

Let’s Take a Look at the Numbers

Figure 2 below illustrates how the YARN-based architecture of Hadoop 2 worked its way into the project over the course of 2011 through 2013. YARN underpins Hadoop 2’s fundamental new architecture for supporting workloads that span batch, interactive, and real-time use cases. While the amount of changes to the Hadoop codebase stabilized in 2013, figure 3 shows that significant net-new features were added and completed in 2013 (such as HDFS Snapshots and High Availability), which explains the increase of 260,000+ lines of code versus prior years and demonstrates the ongoing innovation happening within the project.

cn2

cn1

What’s even more interesting than the absolute lines of code contributed is the increase in diversity of organizations contributing to Apache Hadoop between 2012 and 2013 as illustrated in the next two charts from the blog post.

cn4

cn3

The diversification of contributions between 2012 and 2013 are illustrative of the fact that the framework for community collaboration provided by the Apache Software Foundation is working quite well for the Apache Hadoop project, as Andy Oliver’s article mentioned above implied.

We are obviously proud of our contributions to the Apache Hadoop project and we plan to continue to catalyze and energize this amazing community with your help.

And thanks again to @ajis_ka for sharing the analysis.

The post Innovations and Contributions: Apache Hadoop appeared first on Hortonworks.

Apache Tez 0.3 Released!

$
0
0

The Apache Tez community has voted to release 0.3 of the software.

Apache™ Tez is a replacement of MapReduce that provides a powerful framework for executing a complex topology of tasks. Tez 0.3.0 is an important release towards making the software ready for wider adoption by focussing on fundamentals and ironing out several key functions. The major action areas in this release were

  1. Security. Apache Tez now works on secure Hadoop 2.x clusters using the built-in security mechanisms of the Hadoop ecosystem.
  2. Scalability. We tested the software on large clusters, very large data sets and large applications processing tens of TB each to make sure it scales well with both data-sets and machines.
  3. Fault Tolerance. Apache Tez executes a complex DAG workflow that can be subject to multiple failure conditions in clusters of commodity hardware and is highly resilient to these and other sorts of failures.
  4. Stability. A large number of bug fixes went into this release as early adopters and testers put the software through its paces and reported issues.

To prove the stability and performance of Tez, we executed complex jobs comprised of more than 50 different stages and tens of thousands of tasks on a fairly large cluster (> 300 Nodes, > 30TB data). Tez passed all our tests and we are certain that new adopters can integrate confidently with Tez and enjoy the same benefits as Apache Hive & Apache Pig have already.

There are promising signs of wider adoption of Tez, with the Apache Pig community being in the final testing phase of its initial migration to this new framework. The 43rd Bay Area Hadoop User Group meetup became a Tez evening with Apache Hive and Apache Pig showcasing their current and future plans around Apache Tez. In addition, Concurrent Inc. has plans to port to Tez as an execution engine for the Cascading, Scalding & Cascalog family of API’s. Last but not the least, Apache Hive with Tez integration is close to its first official release in Hive 0.13. That’s a great vote of confidence in the readiness of Tez.

Acknowledgements

The rapid progress made by Apache Tez can be attributed to the close cooperation displayed by the Tez, Hive and Pig communities. We would like to call out Vikram Dixit & Gunther Hagleitner from Hive, Rohini Palaniswamy, Daniel Dai & Cheolsoo Park from Pig, Gopal Vijayaraghavan – all-round performance ninja, Rajesh Balamohan – Hive performance guru, Ramya Sunil & Tassapol Athiapinya – Hortonworks QA, for their relentless scrutiny, valuable suggestions and timeless patience.

– Apache Tez team

The post Apache Tez 0.3 Released! appeared first on Hortonworks.

HBase BlockCache Showdown

$
0
0

This blog post originally appeared here and is reproduced in its entirety here. Part 1 can be found here.

The HBase BlockCache is an important structure for enabling low latency reads. As of HBase 0.96.0, there are no less than three different BlockCache implementations to choose from. But how to know when to use one over the other? There’s a little bit of guidance floating around out there, but nothing concrete. It’s high time the HBase community changed that! I did some benchmarking of these implementations, and these results I’d like to share with you here.

Note that this is my second post on the BlockCache. In my previous post, I provide an overview of the BlockCache in general as well as brief details about each of the implementations. I’ll assume you’ve read that one already.

The goal of this exercise is to directly compare the performance of different BlockCache implementations. The metric of concern is that of client-perceived response latency. Specifically, the concern is for the response latency at the 99th percentile of queries – that is, the worst case experience that the vast majority of users will ever experience. With this in mind, two different variables are adjusted for each test: RAM allotted and database size.

The first variable is the amount of RAM made available to the RegionServer process and is expressed in gigabytes. The BlockCache is sized as a portion of the total RAM allotted to the RegionServer process. For these tests, this is fixed at 60% of total RAM. The second variable is the size of the database over which the BlockCache is operating. This variable is also expressed in gigabytes, but in order to directly compare results generated as the first variable changes, this is also expressed as the ratio of database size to RAM allotted. Thus, this ratio is a rough description for the amount “cache churn” the RegionServer will experience, regardless of the magnitude of the values. With a smaller ratio, the BlockCache spends less time juggling blocks and more time servicing reads.

Test Configurations

The tests were run across two single machine deployments. Both machines are identical, with 64G total RAM and 2x Xeon E5-2630@2.30GHz, for a total of 24 logical cores each. The machines both had 6x1T disks sharing HDFS burden, spinning at 7200 RPM. Hadoop and HBase were deployed using Apache Ambari from HDP-2.0. Each of these clusters-of-one were configured to be fully “distributed,” meaning that all Hadoop components were deployed as separate processes. The test client was also run on the machine under test, so as to omit any networking overhead from the results. The RegionServer JVM, Hotspot 64-bit Server v1.6.0_31, was configured to use the CMS collector.

Configurations are built assuming a random-read workload, so MemStore capacity is sacrificed in favor of additional space for the BlockCache. The default LruBlockCache is considered the baseline, so that cache is configured first and its memory allocations are used as guidelines for the other configurations. The goal is for each configuration to allow roughly the same amount of memory for the BlockCache, the MemStores, and other activities of the HBase process itself.

It’s worth noting that the LruBlockCache configuration includes checks to ensure that JVM heap within the process is not oversubscribed. These checks enforce a limitation that only 80% of the heap may be assigned between the MemStore and BlockCache, leaving the remaining 20% for other HBase process needs. As the amount of RAM consumed by these configurations increases, this 20% is likely overkill. A production configuration using so much heap would likely want to override this over-subscription limitation. Unfortunately, this limit is not currently exposed as a configuration parameter. For large memory configurations that make use of off-heap memory management techniques, this limitation is likely not encountered.

Four different memory allotments were exercised: 8G (considered “conservative” heapsize), 20G (considered “safe” heapsize), 50G (complete memory subscription on this machine), and 60G (memory over-subscription for this machine). This is the total amount of memory made available to the RegionServer process. Within that process, memory is divided between the different subsystems, primarily the BlockCache and MemStore. Because some of the BlockCache implementations operate on RAM outside of the JVM garbage collector’s control, the size of the JVM heap is also explicitly mentioned. The total memory divisions for each configuration are as follows. Values are taken from the logs; CacheConfig initialization in the case of BlockCache implementations and MemStoreFlusher for the max heap and global MemStore limit.


Configuration Total Memory Max Heap BlockCache Breakdown Global MemStore Limit
LruBlockCache 8G 7.8G 4.7G lru 1.6G
SlabCache 8G 1.5G 4.74G slabs + 19.8m lru 1.9G
BucketCache, heap 8G 7.8G 4.63G buckets + 47.9M lru 1.6G
BucketCache, offheap 8G 1.9G 4.64G buckets + 48M lru 1.5G
BucketCache, tmpfs 8G 1.9G 4.64G buckets + 48M lru 1.5G
LruBlockCache 20G 19.4G 11.7G lru 3.9G
SlabCache 20G 4.8G 11.8G slabs + 48.9M lru 3.8G
BucketCache, heap 20G 19.4G 11.54G buckets + 119.5M lru 3.9G
BucketCache, offheap 20G 4.9G 11.60G buckets + 120.0M lru 3.8G
BucketCache, tmpfs 20G 4.8G 11.60G buckets + 120.0M lru 3.8G
LruBlockCache 50G 48.8G 29.3G lru 9.8G
SlabCache 50G 12.2G 29.35G slabs + 124.8M lru 9.6G
BucketCache, heap 50G 48.8G 30.0G buckets + 300M lru 9.8G
BucketCache, offheap 50G 12.2G 29.0G buckets + 300.0M lru 9.6G
BucketCache, tmpfs 50G 12.2G 29.0G buckets + 300.0M lru 9.6G
LruBlockCache 60G 58.6G 35.1G lru 11.7G
SlabCache 60G 14.6G 35.2G slabs + 149.8M lru 11.6G
BucketCache, heap 60G 58.6G 34.79G buckets + 359.9M lru 11.7G
BucketCache, offheap 60G 14.6G 34.80G buckets + 360M lru 11.6G
BucketCache, tmpfs 60G 14.6G 34.80G buckets + 360.0M lru 11.6G

These configurations are included in the test harness repository.

Test Execution

HBase ships with a tool called PerformanceEvaluation, which is designed for generating a specific workload against an HBase cluster. These tests were conducted using the randomRead workload, executed in multi-threaded mode (as opposed to mapreduce mode). As of HBASE-10007, this tool can produce latency information for individual read requests. YCSB was considered as an alternative load generator, but PerformanceEvaluationwas preferred because it is provided out-of-the-box by HBase. Thus, hopefully these results are easily reproduced by other HBase users.

The schema of the test dataset is as follows. It is comprised of a single table with a single column family, called “info”. Each row contains a single column, called “data”. The rowkey is 26 bytes long; the column value is 1000 bytes. Thus, a single row is a total of 1032 bytes, or just over 1K, of user data. Cell tags were not enabled for these tests.

The test was run three times for each configuration: database size to RAM allotted ratios of 0.5:1, 1.5:1, and 4.5:1. Because the BlockCache consumes roughly 60% of available RegionServer RAM, these ratios translated roughly into database size to BlockCache size ratios of 1:1, 3:1, 9:1. That is, roughly, not exactly, and in the BlockCache’s favor. Thus, with the first configuration, the BlockCache shouldn’t need to ever evict a block while in the last configuration, the BlockCache will be evicting stale blocks frequently. Again, the goal here is to evaluate the performance of a BlockCache as it experiences varying working conditions.

For all tests, the number of clients was fixed at 5, far below the number of available RPC handlers. This is consistent with the desire to benchmark individual read latency with minimal overhead from context switching between tasks. A future test can examine the overall latency (and, hopefully, throughput) impact of varying numbers of concurrent clients. My intention with HBASE-9953 is to simplify managing this kind of test.

Before a test is run, the database is created and populated. This is done using the sequentialWrite command, also provided by PerformanceEvaluation. Once created, the RegionServer was restarted using the desired configuration and the BlockCache warmed. Warming the cache was performed in one of two ways, depending on the ratio of database size to RAM allotted. For the 0.5:1, the entire table was scanned with a scanner configured with cacheBlocks=true. For this purpose, a modified version of the HBase shell’s count command was used. For other ratios, the randomRead command was used with a sampling rate of 0.1.

With the cache initially warmed, the test was run. Again, randomRead was used to execute the test. This time the sampling rate was set to 0.01 and the latency flag was enabled so that response times would be collected. This test was run 3 times, with the last run being collected for the final result. This was repeated for each permutation of configuration and database:RAM ratio.

HBase served no other workload while under test – there were no concurrent scan or write operations being served. This is likely not the case with real-world application deployments. The previous table was dropped before creating the next, so that the only user table on the system was the table under test.

The test harness used to run these tests is crude, but it’s available for inspection. It also includes patch files for all configurations, so direct reproduction is possible.

Initial Results

The first view on the data is comparing the behavior of implementations at a given memory footprint. This view is informative of how an implementation performs as the ratio of memory footprint to database size increases. The amount of memory under management remains fixed. With the smallest memory footprint and smallest database size, the 99% response latency is pretty consistent across all implementations. As the database size grows, the heap-based implementations begin to separate from the pack, suffering consistently higher latency. This turns out to be a consistent trend regardless of the amount of memory under management.

Also notice that the LruBlockCache is holding its own alongside the off-heap implementations with the 20G RAM hosting both the 30G and 90G data sets. It falls away in the 50G RAM test though, which indicates to me that the effective maximum size for this implementation is somewhere between these two values. This is consistent with the “community wisdom” referenced in the previous post.

The second view is a pivot on the same data that looks at how a single implementation performs as the overall amount of data scales up. In this view, the ratio of memory footprint to database size is held fixed while the absolute values are increased. This is interesting as it suggests how an implementation might perform as it “scales up” to match increasing memory sizes provided by newer hardware.

From this view it’s much easier to see that both the LruBlockCache and BucketCache implementations suffer no performance degradation with increasing memory sizes – so long as the dataset fits in RAM. This result for the LruBlockCache surprised me a little. It indicates to me that under the conditions of this test, the on-heap cache is able to reach a kind of steady-state with the Garbage Collector.

The other surprise illustrated by this view is that the SlabCache imposes some overhead on managing increasingly larger amounts of memory. This overhead is present even when the dataset fits into RAM. In this, it is inferior to the BucketCache.

From this view’s perspective, I believe the BucketCache running in tmpfs mode is the most efficient implementation for larger memory footprints, with offheap mode coming in a close second. Operationally, the offheap mode is simpler to configure as it requires only changes to HBase configuration files. It also suggests the cache is of decreasing benefit with larger datasets (though this should be intuitively obvious).

Based on this data, I would recommend considering use of an off-heap BucketCache cache solution when your database size is far larger than the amount of available cache, especially when the absolute size of available memory is larger than 20G. This data can be used to inform the purchasing decisions regarding amount of memory required to host a dataset of a given size. Finally, I would consider deprecating the SlabCacheimplementation in favor of the BucketCache.

Here’s the raw results. It includes additional latency measurements at the 95% and 99.9%.

Follow-up test

With the individual implementation baselines established, it’s time to examine how a “real world” configuration holds up. The BucketCache and SlabCache are designed to be run as a multi-level configuration, after all. For this, I chose to examine only the 50G memory footprint. The total 50G was divided between onheap and offheap memory. Two additional configurations were created for each implementation: 10G onheap + 40G offheap and 20G onheap + 30G offheap. These are compared to running with the full 50G heap and LruBlockCache.

This result is the most impressive of all. When properly configured in L1+L2 deployment, the BucketCache is able to maintain sub-millisecond response latency even with the largest database tested. This configuration significantly outperforms both the single-level LruBlockCache and the (effectively) single-level BucketCache. There is no apparent difference between 10G heap and 20G heap configurations, which leads me to believe, for this test, the non-data blocks fit comfortably in the LruBlockCache with even the 10G heap configuration.

Again, the raw results with additional latency measurements.

Conclusions

When a dataset fits into memory, the lowest latency results are experienced using the LruBlockCache. This result is consistent regardless of how large the heap is, which is perhaps surprising when compared to “common wisdom.” However, when using a larger heap size, even a slight amount of BlockCache churn will cause the LruBlockCache latency to spike. Presumably this is due to the increased GC activity required to manage a large heap. This indicates to me that this test establishes a kind of false GC equilibrium enjoyed by this implementation. Further testing would intermix other activities into the workload and observe the impact.

When a dataset is just a couple times larger than available BlockCache and the region server has a large amount of RAM to dedicate to caching, it’s time to start exploring other options. The BucketCache using the fileconfiguration running against a tmpfs mount holds up well to an increasing amount of RAM, as does the offheap configuration. Despite having slightly higher latency than the BucketCache implementations, the SlabCache holds its own. Worryingly, though, as the amount of memory under management increases, its trend lines show a gradual rise in latency.

Be careful not to oversubscribe the RAM in systems running any of these configurations, as this causes latency to spike dramatically. This is most clear in the heap-based BlockCache implementations. Oversubscribing the memory on a system serving far more data than it has available cache results in the worst possible performance.

I hope this experiment proves useful to the wider community. Hopefully these results can be reproduced without difficulty and that other can pick up where I’ve left off. Though these results are promising, a more thorough study is warranted. Perhaps someone out there with even larger memory machines can extend the X-axis of these graphs into the multiple-hundreds of gigabytes.

The post HBase BlockCache Showdown appeared first on Hortonworks.


Defining Enterprise Hadoop

$
0
0

Due to the flourish of Apache Software Foundation projects that have emerged in recent years in and around the Apache Hadoop project, a common question I get from mainstream enterprises is: What is the definition of Hadoop?

Download our Whitepaper: Hadoop and a Modern Data Architecture.

This question goes beyond the Apache Hadoop project itself, since most folks know that it’s an open source technology borne out of the experience of web scale consumer companies such as Yahoo!, Facebook and others who were confronted with the need to store and process massive quantities of data. The question is more about making sense of the wide range of innovative projects that help make Hadoop more relevant and useful for mainstream enterprises.

Before answering, I usually reframe the question as: What is Enterprise Hadoop?

A Blueprint for Enterprise Hadoop

To fully frame out the “Enterprise Hadoop” context, I draw a diagram with 8 key areas worth talking about. The 3 gray boxes set the broader context within the enterprise, while the 5 green boxes outline the core capabilities required of Enterprise Hadoop.

Presentation & Applications
Enable both existing and new applications to provide value to the organization.
Enterprise Management & Security
Empower existing operations and security tools to manage Hadoop.
Governance& Integration
Data Workflow, Lifecycle & Governance
Data Access
Access your data simultaneously in multiple ways (batch, interactive, real-time)
Batch
Script
SQL
NoSQL
Stream
Others...
  • In-Mem
  • Search
Store and process your Corporate Data Assets
HDFS
Distributed File System
Data Management
Security
Store and process your Corporate Data Assets
Authentication, Authorization, Accounting & Data Protection
Operations
Deploy, Manage and Monitor
Provision, Manage & Monitor
Scheduling
Deployment Choice
  • Linux & Windows
  • On Premise or Cloud/Hosted

It all starts with the Presentation & Application box that’s about the new and existing applications that will leverage and derive value from data stored in Hadoop. In order to maximize Enterprise Hadoop’s impact, it also needs to support a wide range of Deployment Options spanning physical, virtual, and cloud. Since we’re talking about real applications and data important to the business, the Enterprise Hadoop platform needs to integrate with existing Enterprise Management & Security tools and processes.

That leaves us with the following 5 areas of core Enterprise Hadoop capabilities:

  1. Data Management is about storing and processing vast quantities of data in a scale out manner.
  2. Data Access is about enabling you to interact with your data in a wide variety of ways – spanning batch, interactive, and real-time use cases.
  3. Data Governance & Integration is about quickly and easily loading your data and managing it according to policy.
  4. Security addresses the core requirements of Authentication, Authorization, Accounting and Data Protection.
  5. Operations addresses the need to provision, manage, monitor and operate Hadoop clusters at scale.

Working on Enterprise Hadoop within the Community for the Enterprise

core1As Apache Hadoop has taken its role in enterprise data architectures, a host of open source projects have been contributed to the Apache Software Foundation (ASF) by both vendors and users alike that greatly expand Hadoop’s capabilities as an enterprise data platform. Many of the “committers” for these open source projects are Hortonworks employees. For those unfamiliar with the term “committer”, they are the talented individuals who devote their time and energy on specific Apache projects adding features, fixing bugs, and reviewing and approving changes submitted by others interested in contributing to the project.

At Hortonworks, we have over 100 committers authoring code and providing stewardship within the Apache Community across a wide range of Hadoop-related projects. Since we are focused on serving the needs of mainstream enterprise users, we have a rigorous engineering process and related test suites that integrate, test, and certify at scale this wide range of projects into an easy to use and consume Enterprise Hadoop platform called the Hortonworks Data Platform. Those talented people in our engineering team provide the foundation for the industry-leading support and services that we deliver directly or through our partners to the market.

Hortonworks Data Platform Delivers Enterprise Hadoop

At Hortonworks, we’ve maintained a consistent focus on enabling Hadoop to be an enterprise-viable data platform that uniquely powers a new generation of data-driven applications and analytics. Let’s take a look at the 5 areas of core Enterprise Hadoop and the Hortonworks Data Platform in more detail.

11

Data Management: The Hadoop Distributed File System (HDFS) provides the foundation for storing data in any format at scale across low-cost commodity hardware. YARN, introduced in the Apache Hadoop 2 release, is a must-have for Enterprise Hadoop deployments since it acts as the platform’s data operating system – providing the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in HDFS with predictable performance and service levels.

Data Access: While classic Batch-oriented MapReduce applications are important, thanks to the introduction of YARN, they are not the only workloads that can run natively “IN-Hadoop”. Technologies for Scripting, SQL, NoSQL, Search, and Streaming are integrated into the Hortonworks Data Platform. Apache Pig provides Scripting capabilities, and Apache Hive is the de-facto standard SQL engine for handling BOTH batch and interactive SQL data access and is proven at petabyte scale. Apache HBase is a popular columnar NoSQL database and Apache Accumulo, with its cell-level security, is used in high-security NoSQL use cases. Apache Storm supports real-time stream processing commonly needed for sensor and machine data use cases. And there are other data access engines including Apache Spark for in-memory iterative analytics and a wide range of 3rd-party ISV solutions expected to plug into the platform over 2014 and beyond. Thanks to YARN, all of these data access engines can work across one set of data in a coordinated and predictable manner.

Data Governance & Integration: Apache Falcon provides policy-based workflows for governing the lifecycle of flow of data in Hadoop, including disaster recovery and data retention use cases. For data ingest, Apache Sqoop makes it easy to bring data from other databases into Hadoop, and Apache Flume enables logs to easily flow into Hadoop. NFS and WebHDFS interfaces provide familiar and flexible ways to store and interact with data in HDFS.

Security: Providing a holistic approach to authentication, authorization, accounting, and data protection, security is handled at every layer of the Hadoop stack: from the HDFS storage and YARN resource management layers, to the data access components such as Hive as well as the overall data pipelines coordinated by Falcon, on up through the perimeter of the entire cluster via Apache Knox.

Operations: Apache Ambari offers a comprehensive solution including the necessary user interface and REST APIs for enabling operators to provision, manage and monitor Hadoop clusters as well as integrate with other enterprise management solutions.

As you can see, the Hortonworks Data Platform addresses everything that’s needed from an Enterprise Hadoop solution – all delivered as a 100% open source platform that you can rely on.

To learn more about Enterprise Hadoop and how it powers the Modern Data Architecture including the common journey from new analytic applications to a Data Lake, I encourage you to download our whitepaper here.

The post Defining Enterprise Hadoop appeared first on Hortonworks.

Congratulations to Leslie Lamport on winning the 2013 Turing Award

$
0
0

Hortonworks would like to congratulate Leslie Lamport on winning the 2013 Turing Award given by the Association of Computing Machinery. This award is essentially the equivalent of the Nobel Prize for computer science.  Among Lamport’s many and varied contributions to the field computer science are: TLA (Temporal Logic for Actions)LaTeX and PAXOS.

The latter of these, the PAXOS three phase consensus protocol, inspires the Zookeeper coordination service, and powers HBase and highly available HDFS. Lamport described the algorithm for the PAXOS protocol with creativity and wit in his paper titled “The Part-Time Parliament” which was submitted in 1990 and published in 1998.

We are delighted to celebrate Lamport’s contribution and the application of his work to the Apache Hadoop family of open source projects. Hadoop clusters process and serve some of the world’s largest datasets in an efficient and fault tolerant manner.

Lamport was born February 8, 1941 in New York City.  Lamport graduated from Bronx High School of Science, and received a B.S. in Mathematics from M.I.T in 1960, and M.S.  and Ph.D. degrees in Mathematics from Brandeis University in 1963 and 1972, respectively. His previous awards include the Dijkstra Prize in Distributed Computing in 2005, and the von Neumann Medal in 2008.  He currently works as a Research Scientist at Microsoft Research.

Further Reading:

The post Congratulations to Leslie Lamport on winning the 2013 Turing Award appeared first on Hortonworks.

Hadoop GroupMapping – LDAP Integration

$
0
0

LDAP provides a central source for maintaining users and groups within an enterprise. There are two ways to use LDAP groups within Hadoop. The first is to use OS level configuration to read LDAP groups. The second is to explicitly configure Hadoop to use LDAP-based group mapping.

Here is an overview of steps to configure Hadoop explicitly to use groups stored in LDAP.

  • Create Hadoop service accounts in LDAP
  • Shutdown HDFS NameNode & YARN ResourceManager
  • Modify core-site.xml to point to LDAP for group mapping
  • Re-start HDFS NameNode & YARN ResourceManager
  • Verify LDAP based group mapping

Prerequisites: Access to LDAP and the connection details are available.

Step 1: Create Hadoop service accounts in LDAP

Here is an example services.ldif file which defines the Hadoop service accounts (hcat, mapred, hdfs, yarn, hbase, zookeeper, oozie, hive). It also defines the Hadoop group and makes Hadoop services a member of the Hadoop group. Add the accounts and groups in LDIF to your LDAP. Here is an example using the ldapadd command to do just that:

ldapadd -f /vagrant/provision/services.ldif -D cn=manager,dc=hadoop,dc=apache,dc=org -w hadoop

Note: The values in italics are specific to your environment.

Step 2: Shutdown Hadoop

See the Hortonworks Data Platform documentation for steps on shutting down HDFS NameNode & YARN ResourceManager.

Step 3: Modify core-site.xml to point to LDAP for group mapping

Back up your core-site.xml before making modifications to it. Below is a sample configuration that needs to be added to core-site.xml. You will need to provide the value for the bind user, bind password and other properties specific to your LDAP and make sure object class, user & group filter match the values specified in services.ldif

<property
  <name>hadoop.security.group.mapping</name>
  <value>org.apache.hadoop.security.LdapGroupsMapping</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.bind.user</name>
  <value>cn=Manager,dc=hadoop,dc=apache,dc=org</value>
</property>
<!--
<property>
  <name>hadoop.security.group.mapping.ldap.bind.password.file</name>
  <value>/etc/hadoop/conf/ldap-conn-pass.txt</value>
</property>
-->
<property>
  <name>hadoop.security.group.mapping.ldap.bind.password</name>
  <value>hadoop</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.url</name>
  <value>ldap://localhost:389/dc=hadoop,dc=apache,dc=org</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.url</name>
  <value>ldap://localhost:389/dc=hadoop,dc=apache,dc=org</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.base</name>
  <value></value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.search.filter.user</name>
  <value>(&amp;(|(objectclass=person)(objectclass=applicationProcess))(cn={0}))</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.search.filter.group</name>
  <value>(objectclass=groupOfNames)</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.search.attr.member</name>
  <value>member</value>
</property>
<property>
  <name>hadoop.security.group.mapping.ldap.search.attr.group.name</name>
  <value>cn</value>
</property>

While group mapping configuration supports reading password from a file, in the above example relevant configuration is commented out due to this bug (HADOOP-10249) .

Step 4 : Re-start Hadoop

Follow the instructions in the Hortonworks Data Platform documentation to re-start HDFS NameNode & YARN ResourceManager.

Step 5: Verify LDAP group mapping

Run hdfs groups command. This command will fetch groups from LDAP for the current user. Note with LDAP group mapping configured, the hdfs permission can leverage groups defined in LDAP for access control

Conclusion

Since there are two ways in Hadoop to use groups in LDAP, a basic question is when to use each way. The OS based group mapping is a Linux/Unix method and won’t work on Windows. The explicit group mapping covered in this post will work on both Linux & Windows. However the explicit group mapping is not sufficient for a Kerberized cluster.

Let me know if you run into any issues with the steps in this post or have any comments on this post. In the next post I will cover configuring OS to read group information from LDAP.

The post Hadoop GroupMapping – LDAP Integration appeared first on Hortonworks.

Apache Storm and Hadoop

$
0
0

storm1In February 2014, the Apache Storm community released Storm version 0.9.1. Storm is a distributed, fault-tolerant, and high-performance real-time computation system that provides strong guarantees on the processing of data. Hortonworks is already supporting customers using this important project today.

Many organizations have already used Storm, including our partner Yahoo! This version of Apache Storm (version 0.9.1) is:

  • Highly scalable. Like Hadoop, Storm scales linearly
  • Fault-tolerant. Automatically reassigns tasks if a node fails
  • Reliable. Supports “at least once” and “exactly once” processing semantics
  • Language agnostic. Processing logic can be defined in any language (e.g. Ruby, Python, Javascript, Perl) and,
  • It is an Apache Project. Which brings with it the brand, governance and large community of the Apache Software Foundation.

Netty-based Messaging Transport

The biggest code change in version 0.9.1 was the removal of the 0MQ transport in favor of a pure java Netty-based transport. Special thanks to the engineering team at Yahoo! for contributing that.

Previously, installing the 0MQ native binaries proved difficult for many users.  The pure-java solution cures that headache. Netty also improves Storm’s performance over 0MQ, allowing twice as many messages per second through the same cluster.

All this being said, the 0MQ transport is still an available and supported option for those who want to use it.

Windows Platform Support

This is the first release of Storm with built-in Windows support. This is an important step for those who have invested in a Windows-based infrastructure and want to use Storm for real-time, stream processing.

Hortonworks Data Platform is the only Hadoop distribution that supports Windows. Now that Storm is part of HDP, it will also run on Windows.

Apache Maven for Storm Builds

From a developer perspective, we migrated from using Leiningen as our build tool to using Apache Maven. This was the right thing to do for release management. Maven had more options when it came to integrating Storm’s build process with the ASF release infrastructure.

Now we’re in a much better position to release early and often.

Coming Next: Security, Multi-tenancy and Storm-On-YARN

Now that we have our first Apache release out, we’re in a better position to work on what matters most to our users: improving Storm and adding new features.

A focus for upcoming releases will be security and multi-tenancy. The engineering team at Yahoo! has contributed a tremendous amount of work in that regard, and we’ll be looking to get those features added to the main codebase.

There is also a lot of interest in support for running Storm on YARN. Again, Yahoo! has done a lot of work in this area, and has open-sourced a preliminary implementation of Storm on YARN.

Storm Comes to the Apache Software Foundation (ASF)

This is Storm’s first release from the Apache Software Foundation. The ASF ensures that released software adheres to a stringent set of licensing and distribution rules that protect both the users of the software and the contributing developers.

Thanks to the Team

Many people worked hard to bring Storm into the ASF and to release version 0.9.1. Thanks to the following folks who made this release possible: Andy Feng; David Lao; Derek Dagit; Flip Kromer; James Xu; Jason Jackson; Nathan Marz and Robert Evans.

DOWNLOADS: http://storm.incubator.apache.org/downloads.html

RELEASE NOTES: https://git-wip-us.apache.org/repos/asf?p=incubator-storm.git;a=blob_plain;f=CHANGELOG.md;hb=254ec135b9a67b1e7bc8e979356274aee2e7d715

The post Apache Storm and Hadoop appeared first on Hortonworks.

Hortonworks Announces $100M Funding to innovate Enterprise Hadoop and drive Ecosystem Expansion

$
0
0

We are excited to welcome Blackrock and Passport Capital as Hortonworks investors who today led a $100M round of funding with continued participation from all existing investors.

This latest round of funding will allow us to double-down on our founding strategy: to make open source Apache Hadoop a true enterprise data platform. To that end we are focused in two areas:

1. Lead the innovation of Hadoop. In open source, for everyone.

Since our inception, we have been focused on leading the delivery of Enterprise Hadoop as a full data platform completely in open source.  For example, a foundational element of this was delivered by Arun Murthy and team with the YARN-based architecture of Hadoop 2 which moved Hadoop beyond its MapReduce heritage into a true multi-purpose data platform.

Thanks to the contributions of so many, Hadoop continues to expand even further to address critical enterprise requirements such as Security, Operations and Data Governance and to enable Hadoop to take its place in a Modern Data Architecture.

Presentation & Applications
Enable both existing and new applications to provide value to the organization.
Enterprise Management & Security
Empower existing operations and security tools to manage Hadoop.
Governance& Integration
Data Workflow, Lifecycle & Governance
Data Access
Access your data simultaneously in multiple ways (batch, interactive, real-time)
Batch
Script
SQL
NoSQL
Stream
Others...
  • In-Mem
  • Search
Store and process your Corporate Data Assets
HDFS
Distributed File System
Data Management
Security
Store and process your Corporate Data Assets
Authentication, Authorization, Accounting & Data Protection
Operations
Deploy, Manage and Monitor
Provision, Manage & Monitor
Scheduling
Deployment Choice
  • Linux & Windows
  • On Premise or Cloud/Hosted

A recent post on this blog highlighted the unique commitment of major vendors in the IT ecosystem who have partnered with us in this approach.

The vision for the value of Hadoop is variously described as data lakes, data hubs, even data reservoirs… But whatever term you may choose to apply, we are committed to delivering the enabling technology as completely open source and wholly integrated with existing enterprise systems. The complete Apache Hadoop platform, incorporating data management, data access, data governance, data security and data operations will be forever and always defined in open source and defended by a strong ecosystem who contributes deeply to it’s success.

2. Extend and enable the ecosystem. A modern data architecture with Hadoop.

We are also focused on enabling the broader ecosystem of IT vendors to incorporate Hadoop into their applications.  In addition to hundreds of partners that we work with daily, we have deep engineering and go-to-market commitments with data center leaders including Microsoft, Red Hat, Teradata, SAP, HP and others that are foundational to our approach.

With this funding we will further invest to enable the go-to-market with these key partners and continue to build the roster of HDP certified applications.

Business Momentum

As validation of our approach to innovating the technology in open source and enabling a broad ecosystem, we have seen tremendous interest in the Hortonworks Data Platform: we have added 250+ customers in the past year, and more than 75,000 downloads of our Sandbox learning environment in just the past 10 months.  This funding will enable us to accelerate our investments to expand the reach of our organization to service our growing and increasingly global customer base.  There is absolutely no doubt interest in Hadoop is growing around the globe, as evidenced by next week’s Hadoop Summit Europe which has completely sold out – surpassing even our most ambitious expectations.

We are very pleased to welcome these two amazing investors to our team and even more excited about the opportunity ahead for all of us.

The post Hortonworks Announces $100M Funding to innovate Enterprise Hadoop and drive Ecosystem Expansion appeared first on Hortonworks.

Introduction to Apache Falcon: Data Governance for Hadoop

$
0
0

Apache Falcon is a data governance engine that defines, schedules, and monitors data management policies. Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie.

InMobi is one of the largest Hadoop users in the world, and their team began the project 2 years ago. At the time, InMobi was processing billions of ad-server events in Hadoop every day. The InMobi team started the project to meet their need for policies to manage how that data flowed into and through their cluster—specifically for replication, lifecycle management, and data lineage and traceability.

This became the Apache Falcon incubator project in April 2013. Since it became an Apache incubator project, three Falcon releases have resolved nearly 400 JIRAs with contributions from many individuals in the open source community.

Apache Falcon will ship with Hortonworks Data Platform version 2.1. This is the first time that Falcon will be included in a top-tier Hadoop distribution.

Although Falcon is a newer addition to the Hadoop ecosystem, it has already been running in production at InMobi for nearly two years. InMobi uses it for various processing pipelines and data management functions, including SLA feedback pipelines and revenue pipelines.

What does Apache Falcon do?

Apache Falcon simplifies complicated data management workflows into generalized entity definitions. Falcon makes it far easier to:

  • Define data pipelines
  • Monitor data pipelines in coordination with Ambari, and
  • Trace pipelines for dependencies, tagging, audits and lineage.

This architecture diagram, gives a high-level view of how Falcon interacts with the Hadoop cluster to define, monitor and trace pipelines:

falc1

Apache Oozie is Hadoop’s workflow scheduler, but mature Hadoop clusters can have hundreds to thousands of Oozie coordinator jobs. At that level of complexity, it becomes difficult to manage so many data set and process definitions.

This results in some common mistakes. Processes might use the wrong copies of data sets. Data sets and processes may be duplicated, and it becomes increasingly more difficult to track down where a particular data set originated.

Falcon addresses these data governance challenges with high-level and reusable “entities” that can be defined once and re-used many times. Data management policies are defined in Falcon entities and manifested as Oozie workflows, as shown in the following diagram.

falc2

Apache Falcon defines three types of entities that can be combined to describe all data management policies and pipelines. These entities are:

  • Cluster
  • Feed (i.e. Data Set)
  • Process

falc3

Getting Started with Apache Falcon

To better illustrate how Apache Falcon manages data, let’s take a look at a few hands-on examples in the Hortonworks Sandbox. For these examples, we will assume that you have a Falcon server running.

See “Hortonworks Technical Preview for Apache Falcon” instructions on Falcon server installation.)

Example #1: A Simple Pipeline

Let’s start with a simple data pipeline. We want to take a data set that comes in every hour, process it with a pig script, and then land the output for further use with other processes.

falc4

To orchestrate this simple process with Falcon, we need to define the required entities in Falcon’s XML format. The XML definition is very straightforward and it helps us better understand how the entity is defined.

Cluster Entity

First we define a Cluster entity. This includes all of the service interfaces used by Falcon. The Feed and Process entities (to follow) both depend on the Cluster definition.

<?xml version="1.0" encoding="UTF-8"?>
<cluster colo="toronto" description="Primary Cluster"
         name="primary-cluster"
         xmlns="uri:falcon:cluster:0.1">
    <tags>class=production,site=canada-east</tags>
    <interfaces>    
        <interface type="readonly" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.4.0"/>
        <interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.4.0"/>
        <interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.4.0"/>
        <interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.1.0"/> 
        <interface type="registry" endpoint="thrift://sandbox.hortonworks.com:9083" version="0.13.0"/>
        <interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.4.3"/>    </interfaces>
    <locations>
        <location name="staging" path="/tmp"/>
        <location name="working" path="/tmp"/>
        <location name="temp" path="/tmp"/>
    </locations>
</cluster>

Submit this Cluster Entity to Falcon:

falcon entity -type cluster -submit -file primary-cluster.xml

Feed Entities

The “raw input feed” and “filtered feed” must also be defined. They are both considered Feed entities in Falcon.

For both of the Feed entities, we define the following key parameters:

  • Feed Frequency. How often this feed will land. This is should coincide with the parameterized input directories (“${YEAR}…”). In our example, the raw-input-feed has data landing every hour and the filtered-feed aggregates the data to a daily file.
  • Cluster. The cluster(s) where this feed will land. The primary-cluster entity is cited by name. Multiple clusters can be specified for replication policies.
  • Retention Policy. How long the data will remain on the cluster. This will create a job that consistently deletes files older than 90 days.
  • Data Set Locations. The HDFS path to the location of the files. This example has a parameterized path, allowing the flexibility to define one Feed entity for sets of data that are from the same source but “partitioned” by folder as they land.

Create the raw-input-feed.xml:

<?xml version="1.0" encoding="UTF-8"?>
<feed description="raw input feed" name="raw-input-feed" xmlns="uri:falcon:feed:0.1">
    <tags>owner=landing,pipeline=adtech,category=click-logs</tags>
    <frequency>minutes(60)</frequency>
    <clusters>
        <cluster name="primary-cluster" type="source">
            <validity start="2014-03-01T00:00Z" end="2016-01-01T00:00Z"/>
            <retention limit="days(90)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/landing/raw-input-feed/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>   
    <ACL owner="adam" group="landing" permission="0x755"/>
	<schema location="/none" provider="none" />
</feed>

Create the filtered-feed.xml:

<?xml version="1.0" encoding="UTF-8"?>
<feed description="filtered feed" name="filtered-feed" xmlns="uri:falcon:feed:0.1">
    <tags>owner=landing,pipeline=adtech,category=click-logs</tags>
    <frequency>days(1)</frequency>
    <clusters>
        <cluster name="primary-cluster" type="source">
            <validity start="2014-03-01T00:00Z" end="2016-01-01T00:00Z"/>
            <retention limit="months(36)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/landing/filtered-feed/${YEAR}-${MONTH}-${DAY}"/>
    </locations>   
    <ACL owner="bizuser" group="analytics" permission="0x755"/>
	<schema location="/none" provider="none" />
</feed>

Submit these entities:

falcon entity -type feed -submit -file raw-input-feed.xml
falcon entity -type feed -submit -file filtered-feed.xml

Schedule the Feeds:

falcon entity -type feed -schedule -name raw-input-feed
falcon entity -type feed -schedule -name filtered-feed

Unlike the Cluster definition, the Feed entities have policies attached to them that need to be explicitly scheduled by Falcon. Falcon takes the retention, replication, feed frequency, and delays and creates Oozie Coordinator jobs to automate all of these actions for you.

Note: To submit the entities you may have to specify the correct proxy user groups in the core-site.xml. You can set these to the explicit groups for your or set these to a value of  “*” to grant everyone user proxy rights. Here are two examples:

hadoop.proxyuser.oozie.groups
hadoop.proxyuser.falcon.groups

Process Entity

The Process entity in Falcon is the high-level abstraction of any data processing task in your pipeline. Falcon can define the following built-in actions:

  • Any Oozie workflow (which includes many actions)
  • A Hive script
  • A Pig script

The actions can be sent parameterized values so you can use one script to work on many different Falcon-defined processes. This helps reduce code complexity and makes it easier to manage multiple workflows.

While you can call complicated Oozie workflows with one Falcon job, we recommend that you split up the complex Oozie workflows into modular steps. This helps Falcon better manage retention of any intermediary data sets. It also allows for reuse of Processes and Feeds.

The following process defines a Pig Script as the execution engine. In this process, we:

  1. Take the data from the HDFS file specified in “raw-input-feed”,
  2. Call the “simplefilter.pig” script to filter a specific subset of the data, and
  3. Land the output data in the feed specified in “filtered-feed”.

The frequency of the input files is every hour and we aggregate that into a daily, filtered data set.

For the Process entity, we define the following key parameters:

  • Clusters – the site or sites where the process is executed
  • Frequency – the frequency of the job execution. This usually corresponds to the named output feed frequency.
  • Inputs – We specify the name of the Feed entity (“raw-input-feed”) and the time range of the data to use in the process.
  • Outputs – We specify the name of the Feed entity (“filtered-feed”) for output and the time that this output will correspond to.  This results in a single file for all of the 1 hour input files in the input Feed.
  • Retry – We can control the number of retry attempts and the delay between each attempt.

Create filter-process.xml:

<?xml version="1.0" encoding="UTF-8"?>
<process name="filter-process" xmlns="uri:falcon:process:0.1">
    <tags>owner=landing,pipeline=adtech,category=click-logs</tags>
    <clusters>
        <cluster name="primary-cluster">
            <validity start="2014-03-02T00:00Z" end="2016-04-30T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>days(1)</frequency>
    <inputs>
        <input name="input" feed="raw-input-feed" start="yesterday(0,0)" end="today(-1,0)" />
    </inputs>
    <outputs>
        <output name="output" feed="filtered-feed" instance="yesterday(0,0)" />
    </outputs>
    <workflow engine="pig" path="/landing/scripts/simplefilter.pig" />
    <retry policy="periodic" delay="minutes(30)" attempts="2" />
</process>

Example pig script used (simplefilter.pig):

A = LOAD '$input' using PigStorage(',') AS (id:chararray, value:chararray, errcode:int);
B = FILTER A BY (errcode == 200);
STORE B INTO '$output' USING PigStorage('|');

Submit and Schedule the process:

falcon entity -type process -submit -file filter-process.xml
falcon entity -type process -schedule -name filter-process

Note the Expression Language used (today(), yesterday()) in the Process entity is defined in the Falcon documentation

Once we schedule our Feeds and Processes we can take a look at the Oozie GUI to see the jobs automatically generated. We will see a job for each retention policy.

falc5

Example #2: Late Data Arrival

falc6

A late data arrival policy can be added to the Process entity. This allows for an alternate workflow to handle input data that does not arrive within the defined times.

A late cut-off time can be supplied on the Feed entities. This uses the Falcon Expression Language to specify the amount of time before a Feed is considered late. We have set this to 10 minutes in our “raw-input-feed”.

Edit raw-input-feed.xml:

<?xml version="1.0" encoding="UTF-8"?>
<feed description="raw input feed" name="raw-input-feed" xmlns="uri:falcon:feed:0.1">
    <tags>owner=landing,pipeline=adtech,category=click-logs</tags>
    <frequency>minutes(60)</frequency>
    <late-arrival cut-off="minutes(10)"/>
    <clusters>
        <cluster name="primary-cluster" type="source">
            <validity start="2014-03-01T00:00Z" end="2016-01-01T00:00Z"/>
            <retention limit="days(90)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/landing/raw-input-feed/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>   
    <ACL owner="adam" group="landing" permission="0x755"/>
	<schema location="/none" provider="none" />
</feed>

The late arrival policy on the Process entity can be specified with: time-delayed back-off, exponential back-off, or final (no delay or back-off). For our modified filter-process, we have specified a strict policy that uses an alternative workflow if a feed is later than its cut-off time (10 minutes, in the case of “raw-input-feed”).  The alternative workflow is defined as an Oozie action located in hdfs://landing/late/workflow (workflow not shown below).

Edit filter-process.xml:

<?xml version="1.0" encoding="UTF-8"?>
<process name="filter-process" xmlns="uri:falcon:process:0.1">
    <tags>owner=landing,pipeline=adtech,category=click-logs</tags>
    <clusters>
        <cluster name="primary-cluster">
            <validity start="2014-03-02T00:00Z" end="2016-04-30T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>days(1)</frequency>
    <inputs>
        <input name="input" feed="raw-input-feed" start="yesterday(0,0)" end="today(-1,0)" />
    </inputs>
    <outputs>
        <output name="output" feed="filtered-feed" instance="yesterday(0,0)" />
    </outputs>
    <workflow engine="pig" path="/landing/scripts/simplefilter.pig" />
    <retry policy="periodic" delay="minutes(30)" attempts="2" />
    <late-process policy="final" delay="minutes(1)">
        <late-input input="input" workflow-path="hdfs://landing/late/workflow"/>
    </late-process>
</process>

Update the relevant Entities:


falcon entity -type feed -name raw-input-feed -file raw-input-feed.xml –update
falcon entity -type process -name filter-process -file filter-process.xml –update

Note: We updated the active existing entities rather than delete and re-schedule them. For other Falcon CLI commands please see the Apache Falcon documentation.

Example #3: Cross-Cluster Replication

falc7

In this final example, we will illustrate Cross-Cluster Replication with a Feed entity. This is a simple way to enforce Disaster Recovery policies or aggregate data from multiple clusters to a single cluster for enterprise reporting. To further illustrate Apache Falcon’s capabilities, we will use an HCatalog/Hive table as the Feed entity.

First, we need to create a second Cluster entity to replicate to. If you are running this as an example, you can startup another Hortonworks Sandbox virtual machine or just point the second Cluster entity back to the same Sandbox you were already using. If you choose to try this out with a separate cluster, specify that second cluster’s endpoints appropriately.

Create secondary-cluster.xml:

<?xml version="1.0" encoding="UTF-8"?>
<cluster colo="toronto" description="Secondary Cluster"
         name="secondary-cluster"
         xmlns="uri:falcon:cluster:0.1">
    <tags>class=production,site=canada-east</tags>
    <interfaces>    
        <interface type="readonly" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.4.0"/>
        <interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.4.0"/>
        <interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.4.0"/>
        <interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.1.0"/> 
        <interface type="registry" endpoint="thrift://sandbox.hortonworks.com:9083" version="0.13.0"/>
        <interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.4.3"/>
    </interfaces>
    <locations>
        <location name="staging" path="/tmp"/>
        <location name="working" path="/tmp"/>
        <location name="temp" path="/tmp"/>
    </locations>
</cluster>

Submit Cluster entity:

falcon entity -type cluster -submit -file secondary-cluster.xml

After we have a secondary cluster to replicate to, we can define the database and tables on both the primary and secondary clusters.  This is a prerequisite for defining a Table Feed entity; Falcon will validate the existence of the tables before allowing them to be defined in a Feed.

Create database and tables in Apache Hive:

-- Run on primary cluster
create database landing_db;
use landing_db;
CREATE TABLE summary_table(id int, value string) PARTITIONED BY (ds string);
ALTER TABLE summary_table ADD PARTITION (ds = '2014-01');
ALTER TABLE summary_table ADD PARTITION (ds = '2014-02');
ALTER TABLE summary_table ADD PARTITION (ds = '2014-03');
 
-- Run on secondary cluster 
create database archive_db;
use archive_db;
CREATE TABLE summary_archive_table(id int, value string) PARTITIONED BY (ds string);

With the tables created, we can define the Feed entity. Two clusters are defined as a source and target for replication. The tables are referred to by a URI that denotes their database and their table name. This Feed could later be used with Falcon processes for Hive and Pig to create larger workflows.

In this example, there is no process required because the replication policy is defined on the Feed itself.

Finally, the retention policy would remove partitions past the retention limit set on each cluster. This last example Feed entity demonstrates the following:

  • Cross-cluster replication of a Data Set
  • The native use of a Hive/HCatalog table in Falcon
  • The definition of a separate retention policy for the source and target tables in replication

Create replication-feed.xml:

<?xml version="1.0" encoding="UTF-8"?>
<feed description="Monthly Analytics Summary" name="replication-feed"
      xmlns="uri:falcon:feed:0.1">
    <tags>owner=landing,pipeline=adtech,category=monthly-report</tags>
    <frequency>months(1)</frequency>
    <clusters>
        <cluster name="primary-cluster" type="source">
            <validity start="2014-03-01T00:00Z" end="2016-01-31T00:00Z"/>
            <retention limit="months(36)" action="delete"/>
        </cluster>
        <cluster name="secondary-cluster" type="target">
            <validity start="2014-03-01T00:00Z" end="2016-01-31T00:00Z"/>
            <retention limit="months(180)" action="delete"/>
            <table uri="catalog:archive_db:summary_archive_table#ds=${YEAR}-${MONTH}" />
        </cluster>
    </clusters>
    <table uri="catalog:landing_db:summary_table#ds=${YEAR}-${MONTH}" />
    <ACL owner="etluser" group="landing" permission="0755"/>
    <schema location="hcat" provider="hcat"/>
</feed>

Submit and Schedule Feed entity:

falcon entity -type feed -submit -file replication-feed.xml
falcon entity -type feed -schedule -name replication-feed

What’s Next?

The Apache Falcon project is expanding rapidly. This blog has only covered a portion of its capabilities.  We plan future blogs to cover:

  • Distributed Cluster Management with Apache Falcon Prism
  • Management UI
  • Data Lineage
  • Data Management Alerting and Monitoring
  • Falcon Integration and Extensibility via the REST API
  • Secure Data Management

Additional Information

The post Introduction to Apache Falcon: Data Governance for Hadoop appeared first on Hortonworks.

Advancing Enterprise Hadoop with Hortonworks Data Platform 2.1

$
0
0

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

Download HDP 2.1 Technical Preview Now

What’s In Hortonworks Data Platform 2.1?

Presentation & Applications
Enable both existing and new applications to provide value to the organization.
Enterprise Management & Security
Empower existing operations and security tools to manage Hadoop.
Governance Integration
Data Workflow, Lifecycle & Governance
Data Access
Access your data simultaneously in multiple ways (batch, interactive, real-time)
Script
SQL
NoSQL
Stream
Search
Others...
  • In-Mem
  • Search
Store and process your Corporate Data Assets
HDFS
Distributed File System
Data Management
Security
Store and process your Corporate Data Assets
Authentication, Authorization, Accounting & Data Protection
Operations
Deploy, Manage and Monitor
Provision, Manage & Monitor
Scheduling
Deployment Choice
  • Linux & Windows
  • On Premise or Cloud/Hosted

The advancements in HDP 2.1 span every aspect of Enterprise Hadoop: from data management, data access, integration & governance, security and operations.  All of this is delivered via Apache Software Foundation projects. While there are many enhancements to all projects, below are just a few key highlights of HDP 2.1.

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Hive PerformanceHDP delivers on the commitments made last year with the final phase of the Stinger Initiative; a concerted effort to improve the performance of Apache Hive and SQL query in Hadoop. Apache Hive is already the most-widely used data access engine for Hadoop. And for good reason as it also has the widest commitment of community development.

On top of the innovations in YARN and Apache Tez, 145 developers across 45 unique companies (Microsoft, SAP, Facebook, Hortonworks to name just a few) have contributed over 390,000 lines of code to Apache Hive.

The result? Hadoop users and developers now have native interactive SQL query at petabyte scale in Apache Hive.

Data Governance with Apache Falcon

HDP 2.1 includes Apache Falcon, an open source project that delivers a reliable, repeatable and simple framework for managing the flow of data in and out of Hadoop. This control structure, along with a set of tooling to ease and automate the application of schema or metadata on sources, is critical for the successful integration of Hadoop into your modern data architecture.

For an introduction to Data Governance with Apache Falcon, take a look at our recent blog post.

Security with Apache Knox

Security is addressed in Hadoop across multiple layers and in this release we added numerous new security advances; most notable include ACLs for HDFS and Grant/Revoke functions for Apache Hive.  However, the largest security advancement is the addition of Apache Knox.

Apache Knox provides perimeter security through a single point of authentication/access for your cluster and integrates with your existing Active Directory or LDAP implementations.  Again, Knox is an example of a key technology being contributed to open source by a major ecosystem vendor and is indicative of the community force behind the delivery of the Enterprise Hadoop platform completely in open source.

Stream Processing with Apache Storm

Stream processing has emerged as a key use case for Hadoop and as a result we have been supporting Apache Storm for stream processing within dozens of our customers.  As announced when we initiated the work several months ago, we are now including Storm as a native component of the Hortonworks Data Platform.

Searching Hadoop Data with Apache Solr

Another key use case for Hadoop is Search, and we are extremely pleased to announce that we are adding support for Apache Solr in HDP 2.1 to enable native search functionality.  Apache Solr extends Hadoop with a powerful user interface for advanced search applications that unlocks a range of use cases focused on user search across very large data sets stored in Hadoop.

Advanced Operations with Apache Ambari

HDP 2.1 includes the very latest version of Apache Ambari which supports new platform services including Storm, Falcon, and Tez, provides extensibility and rolling restarts, as well as other significant operational improvements.

This is clearly a major milestone for the Hadoop community and a significant release of the Hortonworks Data Platform and a full list of capabilities can be found in the release notes.

Availability

We have made available a single VM download of HDP 2.1 so that users can get started today, while a complete version of the product for both Linux and Windows will be available later in April.

The post Advancing Enterprise Hadoop with Hortonworks Data Platform 2.1 appeared first on Hortonworks.


Tutorials for Hadoop with HDP 2.1: Hive, Tez, Falcon, Knox, Storm

$
0
0

If you’re excited to get started with the new features in Hortonworks Data Platform 2.1, then we’ve included 4 tutorials for you try out – Sandbox-style.

You can download the HDP 2.1 Technical Preview here, and then get stuck into these great tutorials.

Interactive Query with Apache Hive and Apache Tez

OK, so you’re not going to get huge performance out of a one-node VM, but you can try out Hive on Tez, and see the performance gains versus MapReduce, and also try out features such as Vectorized Query, and the host of new SQL features. Get supercharged here.

Defining and Processing Data Pipelines with Apache Falcon

Sometimes, it’s not all about speed. Sometimes you want surety and governance on the data movements across the cluster. In this tutorial, we simulate a dataset movement from one cluster to another and perform cleansing as we do that. Define your pipeline here.

Processing Steam data in near real-time with Apache Storm

But then who am I kidding? Of course it’s all about speed. In this case, speed of response to incoming stream data. This tutorial sets up Apache Storm to read and react to incoming sentences. Process your streams here.

Secure your Hadoop infrastructure with Apache Knox

With data flying around in all directions, its probably worth taking a look at Apache Knox to provide perimeter security for your cluster – even if it is just one node. Batten down the hatches here.

We hope you have some fun testing out the new features of HDP 2.1 with these tutorials, and that they provide the inspiration for your own production work. If you have any comments, let us know below, or in the forums. And if you’d like a Hortonworks elephant, be sure to add your own tutorial over here.

The post Tutorials for Hadoop with HDP 2.1: Hive, Tez, Falcon, Knox, Storm appeared first on Hortonworks.

Bringing Enterprise Search to Enterprise Hadoop

$
0
0

Today we are proud to announce that the formation of a terrific partnership with LucidWorks to bring search to the Hortonworks Data Platform. LucidWorks delivers an enterprise-grade search development platform built atop the power of Apache Solr.

Presentation & Applications
Enable both existing and new applications to provide value to the organization.
Enterprise Management & Security
Empower existing operations and security tools to manage Hadoop.
Governance Integration
Data Workflow, Lifecycle & Governance
Data Access
Access your data simultaneously in multiple ways (batch, interactive, real-time)
Script
SQL
NoSQL
Stream
Search
Others...
  • In-Mem
  • Search
Store and process your Corporate Data Assets
HDFS
Distributed File System
Data Management
Security
Store and process your Corporate Data Assets
Authentication, Authorization, Accounting & Data Protection
Operations
Deploy, Manage and Monitor
Provision, Manage & Monitor
Scheduling
Deployment Choice
  • Linux & Windows
  • On Premise or Cloud/Hosted

Shared Vision and New Scenarios

Both LucidWorks and Hortonworks have a shared vision of innovating in open source and delivering it to customers in an enterprise grade platform.

As part of our continuing mission to build the a completely open, versatile enterprise data platform across many data processing scenarios then Solr offers a simple, yet powerful interface providing advanced search capabilities.

The massive interest and adoption of Hadoop means that there are increasing numbers of users of Hadoop-based data. The combination of Apache Solr and Hortonworks Data Platform (HDP) makes big data accessible for non-technical users to access and analyze data in a method that nearly everyone is familiar with.

Under the terms of the partnership, Hortonworks will include Solr in HDP, and Hortonworks customers will receive support and technology from both Hortonworks and LucidWorks. There are several components of the partnership.

  • Hortonworks now provides Apache Solr as a component of the Hortonworks Data Platform 2.1
  • Hortonworks will jointly provide enterprise support for Solr
  • We will go to market together to jointly market and sell search to the enterprise.

You can read more about the announcement here.

The post Bringing Enterprise Search to Enterprise Hadoop appeared first on Hortonworks.

HDFS ACLs: Fine-Grained Permissions for HDFS Files in Hadoop

$
0
0

HDFSACLSecuring any system requires you to implement layers of protection.  Access Control Lists (ACLs) are typically applied to data to restrict access to data to approved entities. Application of ACLs at every layer of access for data is critical to secure a system. The layers for hadoop are depicted in this diagram and in this post we will cover the lowest level of access… ACLs for HDFS.

This is part of the HDFS Developer Trail series.  Other posts in this series include:

Background

For several years, HDFS has supported a permission model equivalent to traditional Unix permission bits [5].  For each file or directory, permissions are managed for a set of 3 distinct user classes: owner, group, and others.  There are 3 different permissions controlled for each user class: read, write, and execute.  When a user attempts to access a file system object, HDFS enforces permissions according to the most specific user class applicable to that user.  If the user is the owner, then HDFS checks the owner class permissions.  If the user is not the owner, but is a member of the file system object’s group, then HDFS checks the group class permissions.  Otherwise, HDFS checks the others class permissions.

This model is sufficient to express a large number of security requirements at the block level.  For example, consider a sales department that wants a single user, the department manager, to control all modifications to sales data.  Other members of the department need to view the data, but must not be able to modify it.  Everyone else in the company outside the sales department must not be able to view the data.  This requirement can be implemented by running chmod 640 on the file, with the following outcome:

-rw-r-----   3 bruce sales          0 2014-03-04 16:31 /sales-data

Only bruce may modify the file, only members of the sales group may read the file, and no one else may access the file in any way.

Now suppose there are new requirements.  The sales department has grown such that it’s no longer feasible for the manager, bruce, to control all modifications to the file.  Instead, the new requirement is that bruce, diana, and clark are allowed to make modifications.  Unfortunately, there is no way for permission bits to express this requirement, because there can be only one owner and one group, and the group is already used to implement the read-only requirement for the sales team.  A typical workaround is to set the file owner to a synthetic user account, such as salesmgr, and allow bruce, diana, and clark to use the salesmgr account via sudo or similar impersonation mechanisms.

Also suppose that in addition to the sales staff, all executives in the company need to be able to read the sales data.  This is another requirement that cannot be expressed with permission bits, because there is only one group, and it’s already used by sales.  A typical workaround is to set the file’s group to a new synthetic group, such as salesandexecs, and add all users of sales and all users of execs to that group.

Both of these workarounds incur significant drawbacks though.  It forces complexity on to cluster administrators to manage additional users and groups.  It also forces complexity on to end users, because it requires them to use different accounts for different actions.

ACLs Applied!

In general, plain Unix permissions aren’t sufficient when you have permission requirements that don’t map cleanly to an enterprise’s natural hierarchy of users and groups.  Working in collaboration with the Apache community, we developed the HDFS ACLs feature to address this shortcoming.  HDFS ACLs will be available in Apache Hadoop 2.4.0 and Hortonworks Data Platform 2.1.

HDFS ACLs give you the ability to specify fine-grained file permissions for specific named users or named groups, not just the file’s owner and group.  HDFS ACLs are modeled after POSIX ACLs [5].  If you’ve ever used POSIX ACLs on a Linux file system, then you already know how ACLs work in HDFS.  Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules.

To use ACLs, first you’ll need to enable ACLs on the NameNode by adding the following configuration property to hdfs-site.xml and restarting the NameNode.

<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>

Most users will interact with ACLs using 2 new commands added to the HDFS CLI: setfacl and getfacl.  Let’s look at several examples of how HDFS ACLs can help implement complex security requirements.

Example 1: Granting Access to Another Named Group

Going back to our original example, let’s set an ACL that grants read access to sales-data for members of the execs group.

  • Set the ACL.

> hdfs dfs -setfacl -m group:execs:r-- /sales-data

  • Check results by running getfacl.

> hdfs dfs -getfacl /sales-data
# file: /sales-data
# owner: bruce
# group: sales
user::rw-
group::r--
group:execs:r--
mask::r--
other::---

  • Additionally, the output of ls has been modified to append ‘+’ to the permissions of a file or directory that has an ACL.

> hdfs dfs -ls /sales-data
Found 1 items
-rw-r-----+  3 bruce sales          0 2014-03-04 16:31 /sales-data

The new ACL entry is added to the existing permissions defined by the permission bits.  User bruce has full control as the file owner.  Members of either the sales group or the execs group have read access.  All others have no access.

Example 2: Using a Default ACL for Automatic Application to New Children

In addition to an ACL enforced during permission checks, there is also a separate concept of a default ACL.  A default ACL may be applied only to a directory, not a file.  Default ACLs have no direct effect on permission checks and instead define the ACL that newly created child files and directories receive automatically.

Suppose we have a monthly-sales-data directory, further sub-divided into separate directories for each month.  Let’s set a default ACL to guarantee that members of the execs group automatically get access to new sub-directories, as they get created for each month.

  • Set default ACL on parent directory.

> hdfs dfs -setfacl -m default:group:execs:r-x /monthly-sales-data

  • Make sub-directories.

> hdfs dfs -mkdir /monthly-sales-data/JAN
> hdfs dfs -mkdir /monthly-sales-data/FEB

  • Verify HDFS has automatically applied default ACL to sub-directories.

> hdfs dfs -getfacl -R /monthly-sales-data
# file: /monthly-sales-data
# owner: bruce
# group: sales
user::rwx
group::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---

# file: /monthly-sales-data/FEB
# owner: bruce
# group: sales
user::rwx
group::r-x
group:execs:r-x
mask::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---

# file: /monthly-sales-data/JAN
# owner: bruce
# group: sales
user::rwx
group::r-x
group:execs:r-x
mask::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---

The default ACL is copied from the parent directory to the child file or child directory at time of creation.  Subsequent changes to the parent directory’s default ACL do not alter the ACLs of existing children.

Example 3: Blocking Access to a Sub-Tree for a Specific User

Suppose there is an emergency need to block access to an entire sub-tree for a specific user.  Applying a named user ACL entry to the root of that sub-tree is the fastest way to accomplish this without accidentally revoking permissions for other users.

  • Add ACL entry to block all access to monthly-sales-data by user diana.

> hdfs dfs -setfacl -m user:diana:--- /monthly-sales-data

  • Check results by running getfacl.

> hdfs dfs -getfacl /monthly-sales-data
# file: /monthly-sales-data
# owner: bruce
# group: sales
user::rwx
user:diana:---
group::r-x
mask::r-x
other::---
default:user::rwx
default:group::r-x
default:group:execs:r-x
default:mask::r-x
default:other::---

It’s important to keep in mind the order of evaluation for ACL entries when a user attempts to access a file system object:

  1. If the user is the file owner, then the owner permission bits are enforced.

  2. Else if the user has a named user ACL entry, then those permissions are enforced.

  3. Else if the user is a member of the file’s group or any named group in an ACL entry, then the union of permissions for all matching entries are enforced.  (The user may be a member of multiple groups.)

  4. If none of the above were applicable, then the other permission bits are enforced.

In this example, the named user ACL entry accomplished our goal, because the user is not the file owner, and the named user entry takes precedence over all other entries.

Development

This feature was addressed in issue HDFS-4685 [1].  Development and testing was a joint effort during the next several months between multiple active Apache Hadoop contributors.  The scope of the effort required coding across multiple layers of HDFS: new APIs, new shell commands, new file system metadata persisted in the NameNode and enhancements to permission enforcement logic.

During the initial planning, I expected our greatest challenge would be efficient storage management for the new metadata.  This is always an important consideration for new features in the NameNode, where file system metadata consumes precious RAM at runtime and long-term persistence consumes disk for the fsimage and edits.  One of our goals was that existing deployments that do not wish to use ACLs must not suffer increased RAM consumption after introduction of the feature.  An ACL is associated to an inode, and it would be unacceptable to introduce an O(n) (n = # inodes) increase in RAM consumption even if ACLs were not used.  This immediately ruled out the naive implementation of adding a new nullable field to the inode data structure.  (Even if set to null, memory is still consumed by the pointer.)  Early revisions of the design document proposed optimization techniques that involved repurposing unused bits in the permission data structure to act as an index into a shared ACL table.  I expected this would be tricky code requiring detailed edge case testing.

Fortunately, development of ACLs benefited from another change happening in the HDFS codebase simultaneously.  HDFS-5284 [2] provided support for inode features.  An inode feature is a generalized concept of optional attributes associated to a specific inode.  If a particular inode does not have a feature, then that feature does not consume additional RAM.  This was a perfect fit for ACLs!  Much of the early design was discarded and simplified by just defining a new AclFeature class and attaching it to the inode data structure on only the inodes that required it.  We’re also starting to use inode features as a building block for many other NameNode metadata needs, such as snapshots and quotas.

We also benefited from the work in HDFS-5698 [3], which converted the fsimage file from a custom binary format to a much more flexible format utilizing Protocol Buffers.  With that change in place, we no longer needed to write error-prone custom serialization and deserialization logic for the new ACL metadata.  Instead, we simply defined new protobuf messages to represent the new metadata.

After those 2 changes simplified the design, I was quite surprised to find that the real challenge of the project was correctly implementing the core logic for ACL manipulation and enforcement.  We wanted to match existing implementations on Linux as closely as possible to make the feature familiar and easy to use for system administrators.  We also wanted to make sure that ACLs would compose well with other existing HDFS features, like snapshots, the sticky bit and WebHDFS.  We added more than 200 new tests to HDFS and wrote a comprehensive system test plan to cover these scenarios.

This was a community effort.  I want to thank all of the Apache contributors who participated in design, coding, testing and review of the feature: Arpit Agarwal, Dilli Arumugam, Vinayakumar B, Sachin Jose, Renil Joseph, Brandon Li, Haohui Mai, Colin Patrick McCabe, Kevin Minder, Sanjay Radia, Suresh Srinivas, Tsz-Wo Nicholas Sze, Yesha Vora and Jing Zhao.

Have you ever considered contributing to HDFS?  We still need help on libHDFS API bindings for the new ACL APIs.  Patches welcome!

References

  1. HDFS-4685. Implementation of ACLs in HDFS.
  2. HDFS-5284. Flatten INode hierarchy.
  3. HDFS-5698. Use protobuf to serialize / deserialize FSImage.
  4. Gruenbacher, A. (2003). POSIX Access Control Lists on Linux.
  5. Wikipedia contributors (2013). File system permissions – Traditional Unix permissions.

The post HDFS ACLs: Fine-Grained Permissions for HDFS Files in Hadoop appeared first on Hortonworks.

Introducing Apache Tez 0.4

$
0
0

Tez04We are excited to announce that the Apache™ Tez community voted to release version 0.4 of the software.

Apache Tez is an alternative to MapReduce that provides a powerful framework for executing a complex topology of tasks for data access in Hadoop. Version 0.4 incorporates the feedback from extensive testing of Tez 0.3, released just last month.

This release is especially meaningful because it coincides with completion of the Stinger Initiative (a collaborative community effort involving 145 developers across 44 companies) and the upcoming release of Apache Hive 0.13.

Major community achievements in this Tez 0.4 release were:

  • Application Recovery – This is a major improvement to the Tez framework that preserves work when the job controller (YARN Tez Application Master) gets restarted due to node loss or cluster maintenance. When the Tez Application Master restarts, it will recover all the work that was already completed by the previous master. This is especially useful for long running jobs where restarting from scratch would waste work already completed.
  • Stability for Hive on Tez – We did considerable testing with the Apache Hive community to make sure the imminent release of Hive 0.13 is stable on Tez. We appreciate the great partnership.
  • Data Shuffle Improvements – Data shuffling re-partitions and re-distributes data across the cluster. This is a major operation in distributed data processing, so performance and stability are important. Tez 0.4 includes improvements in memory consumption, connection management, and in the handling of errors and empty partitions.
  • Windows Support – The community fixed bugs and made changes to Tez so that it runs as smoothly on Windows as it does on Linux. We hope this will encourage adoption of Tez on Windows-based systems.

We hope that Tez 0.4 provides a stable, reliable and high performance framework for wider community adoption. We encourage you to try out Apache Tez for your use cases. We look forward to hearing feedback and suggestions for improvements. We’re all ears!

Also, we would like to thank the wider Apache community for their support and cooperation.

-       The Apache Tez Team

Download

The post Introducing Apache Tez 0.4 appeared first on Hortonworks.

Apache Hadoop 2.4.0 Released!

$
0
0

hadoop24It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.4.0! Thank you to every single one of the contributors, reviewers and testers!

The community fixed 411 JIRAs for 2.4.0 (on top of the 511 JIRAs resolved for 2.3.0). Of the 411 fixes:

  • 50 are in Hadoop Common,
  • 171 are in HDFS,
  • 160 are in YARN and
  • 30 went into MapReduce

Hadoop 2.4.0 is the second Hadoop release in 2014, following Hadoop 2.3.0’s February release and its key enhancements to HDFS such as Support for Heterogeneous Storage and In-Memory Cache.

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS (HDFS-4685)
  • Native support for Rolling Upgrades in HDFS (HDFS-5535)
  • Smooth operational upgrades with protocol buffers for HDFS FSImage (HDFS-5698)
  • Full HTTPS support for HDFS (HDFS-5305)
  • Support for Automatic Failover of the YARN ResourceManager (YARN-149) (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server (YARN-321) and Application Timeline Server (YARN-1530)
  • Support for strong SLAs in YARN CapacityScheduler via Preemption (YARN-185)

HADOOP-1298 introduced file-permissions in HDFS in Hadoop 0.16 (a blast from the past – this was in January, 2008). Now, HDFS takes a major step forward with support of Access Control Lists (getfacl/setfacl). ACLs enhance the existing HDFS permission model to support controlling file access for arbitrary combinations of users and groups instead of presenting 3 predetermined options: a single owner, single group, and all other users.  Take a look at the HDFS ACLs design document for more details.

Hadoop clusters are growing, and some operations teams are challenged with upgrading as many as 5000 HDFS nodes, storing more than 100 petabytes of data. Rolling Upgrades make this significantly easier to manage. Switching the HDFS FSImage to use protocol-buffers also eases operations, since it allows safe HDFS upgrades to newer versions with better rollback capabilities (in face of software bugs or human errors).

Security is a key concern for Apache Hadoop and we are pleased that version 2.4.0 includes full HTTPS support for HDFS across all components: WebHDFS, HsFTP and even web-interfaces.

With automatic failover of the YARN ResourceManager, applications can smoothly failover to a (cold) standby ResourceManager in case of operational issues such as hardware failures. The new ResourceManager will automatically restart applications. In the next phase we plan to add a hot standby that can continue to run applications from the point of failover, to preserve any work already completed.

We are also seeing the community take advantage of YARN’s promise, with many diverse applications now implemented (or ported over) to run on YARN. From this, we have received important feedback that it would be useful for YARN to provide standard services to track and store application-specific metrics such as containers used and resources consumed.

So we are thrilled to note that YARN now provides better metrics capabilities with a generic Application Timeline Server (ATS). ATS uses a NoSQL store at the backend (which defaults to a single-node LevelDB, with HBase for scale-out) and provides extremely fast writes for millions of metrics and some key aggregation capabilities during retrieval. ATS also provides a very-simple REST interface to PUT & GET application timeline data.

ATS is already being used by key applications such as Apache Tez & Apache Hive to store query metrics and render GUIs on the client-side, using JavaScript by presenting the JSON in a human-friendly manner on the browser!

Preemption in YARN CapacityScheduler had been available since Hadoop 2.2, but, to my knowledge, this was the first time that anyone had extensively validated the feature and it came out with flying colors! Many thanks to Carlo Curino & Chris Douglas, the original contributors.

Looking Ahead to Apache Hadoop 2.5.0

As always, the Apache Hadoop community is looking ahead, with our eyes on a number of enhancements to the core platform for Apache Hadoop 2.5. Here is a preview:

  • First-class support for rolling upgrades in YARN, with:
    • Work-preserving ResourceManager restart (YARN-556)
    • Container-preserving NodeManager restart (YARN-1336)
  • Support for admin-specified labels for servers in YARN for enhanced control and scheduling (YARN-796)
  • Support for applications to delegate resources to others in YARN. This will allow external services to share not just YARN’s resource-management capabilities but also it’s workload-management capabilities. (YARN-1488)
  • Support for automatically sharing application artifacts in the YARN distributed cache. (YARN-1492)

Acknowledgements

Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community.

In particular I’d like to call out the following folks: Chris Nauroth, Haohui Mai & Vinaykumar B. for their work on HDFS ACLs; Haohui Mai for his work on using protobufs for FSImage; Tsz Wo Sze, Kihwal Lee, Arpit Agarwal, Brandon Li & Jing Zhao for their work on Rolling Upgrades for HDFS; Karthik Kambatla, Xuan Gong and Tsuyoshi Ozawa for their work on YARN ResourceManager automatic failover; Zhijie Shen, Mayank Bansal, Billie Rinaldi and Vinod K. V. for their work on YARN ATS/AHS and, again, several folks from Twitter such as Gera Shegalov, Lohit V., Joep R., Sangjin Lee et al for a number of unsung, but very key operational enhancements and bug-fixes to YARN. Last, but not least, a big shout-out to folks such as Ramya Sunil, Yesha Vora, Tassapol A., Arpit Gupta and others who helped validate the release and ensured that we, as a community, can continue to deliver very high quality releases of Apache Hadoop.

Links

The post Apache Hadoop 2.4.0 Released! appeared first on Hortonworks.

Viewing all 859 articles
Browse latest View live