Simplifying user-logs management and access in YARN

November 20, 2013, 10:10 am

≫ Next: Modern Healthcare Architectures Built with Hadoop

≪ Previous: Using Hive to interact with HBase, Part 2

User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, race conditions when running on a cluster, and debugging task/job failures due to hardware or platform bugs. Secondly, one can do historical analyses of the logs to see how individual tasks in job/workflow perform over time. One can even analyze the Hadoop MapReduce user-logs using Hadoop MapReduce(!) to determine any performance issues.

Handling of user-logs generated by applications has been one of the biggest pain-points for Hadoop installations in the past. In Hadoop 1.x, user-logs are left on individual nodes by the TaskTracker, and the management of the log files on local nodes is both insufficient for longer term analyses as well as non-deterministic for user access. YARN tackles this logs’ management issue by having the NodeManagers (NMs) provide the option of moving these logs securely onto a distributed file-system (FS), for e.g. HDFS, after the application completes.

Motivation

In Hadoop 1.x releases, MapReduce is the only programming model available for users. Each MapReduce job runs a bunch of mappers and reducers. Every map/reduce task generates logs directly on the local disk – syslog/stderr/stdout and optionally profile.out/debug.out etc. These task logs are accessible through the web UI, which is the only convenient way to view them – login into individual nodes to inspect logs is both cumbersome and sometimes not possible due to access restrictions.

State of the log-management art in Hadoop 1.x

For handling the users’ task-logs, the TaskTrackers (TTs) makes use of a UserLogManager. It is composed of the following components.

UserLogCleaner: Cleans logs of a given job after retaining them for a certain time starting from the job-finish. The configuration property mapred.userlog.retain.hours dictates this time, defaulting to one day. It adds every job’s user-log directory to a cleanup thread to delete them after user-log retain-hours. This thread wakes up every so often to delete any old logs that have hit their retention interval.
On top of that, all logs in ${mapred.local.dir}/userlogs are added to the UserLogCleaner once a TT restarts.
TaskLogsTruncater: This component truncates every log file beyond a certain length after a task finishes. The configuration properites mapreduce.cluster.map.userlog.retain-size and
mapreduce.cluster.reduce.userlog.retain-size control how much of the userlog is to be retained. The assumption here is that when something goes wrong with a task, the tail of the log should generally indicate what the problem is and so we can afford to lose the head of the log when the log grows beyond some size.
mapred.userlog.limit.kb is another solution that predates the above: While a task JVM is running, its stdout and stderr streams are piped to the unix tail program which only retains the specified log size and writes them to stdout/stderr files.

There were a few more efforts that didn’t make to production

Log collection: a user invokable LogCollector that runs on client/gateway machines and collects per-job logs into a compact format.
Killing of running tasks exceeding N GB of logs because otherwise a run-away task can fill up the disk consisting logs, causing downtime.

Problems with existing log management:

Truncation: Users complain about truncated logs more often than that. Few users need to access the complete logs. No limit really satisfies all the users – 100KB works for some, but not so much for others.
Run-away tasks: TaskTrackers/DataNodes can still go out of disk space due to excessive logging as truncation only happens after tasks finish.
Insufficient retention: Users complain about the log-retention time. No admin configured limit satisfies all the users – the default of one day works for many users but many gripe about the fact that they cannot do post analysis.
Accesss: Serving logs over HTTP by the TTs is completely dependent on the job finish time and the retention time – not perfectly reliable.
Collection status: The same non-reliability with a log-collector tool – one needs to build more solutions to detect when the log-collection starts and finishes and when to switch over to such collected logs from the usual files managed by TTs.
mapred.userlog.limit.kb increases memory usage of each task (specifically with lots of logging), doesn’t work with more generic application containers supported by YARN.
Historical analysis: All of these logs are served over HTTP only as long as they exist on the local node – if users want to do post analysis, they will have to employ some log-collection and build more tooling around that.
Load-balancing & SPOF: All the logs are written to a single log-directory – no load balancing across disks & if that one disk is down, all jobs lose all logs.

Admitting that existing solutions were only really stop-gap to a more fundamental problem, we took the opportunity to do the right thing by enabling in-platform log aggregation.

Log-aggregation in YARN: Details

So, instead of truncating user-logs, and leaving them on individual nodes for certain time as done by the TaskTracker, the NodeManager addresses the logs’ management issue by providing the option to move these logs securely onto a file-system (FS), for e.g. HDFS, after the application completes.

Logs for all the containers belonging to a single Application and that ran on a given NM are aggregated and written out to a single (possibly compressed) log file at a configured location in the FS.
In the current implementation, once an application finishes, one will have
- an application level log-dir and
- a per-node log-file that consists of logs for all the containers of the application that ran on this node.
Users have access to these logs via YARN command line tools, the web-UI or directly from the FS – no longer just restricted to a web only interface.
These logs potentially can be stored for much longer times than was possible in 1.x, given they are stored a distributed file-system.
We don’t need to truncate logs to very small lengths – as long as the log sizes are reasonable, we can afford to store the entire logs.
In addition to that, while the containers are running, the logs are now written to multiple directories on each node for effective load balancing and improved fault tolerance.
AggregatedLogDeletionService is a service that periodically deletes aggregated logs. Today it is run inside the MapReduce JobHistoryServer only.

Usage

Web UI

On the web interfaces, the fact that logs are aggregated is completely hidden from the user.

While a MapReduce application is running, users can see the logs from the ApplicationMaster UI which redirects to the NodeManager UI
Once an application finishes, the completed information is owned by the MapReduce JobHistoryServer which again serves user-logs transparently.
For non-MapReduce applications, we are working on a generic ApplicationHistoryServer that does the same thing.

Command line access

In addition to the web-UI, we also have a command line utility to interact with logs.

$ $HADOOP_YARN_HOME/bin/yarn logs
Retrieve logs for completed YARN applications.
usage: yarn logs -applicationId <application ID> [OPTIONS]

general options are:
-appOwner <Application Owner>   AppOwner (assumed to be current user if
                                not specified)
-containerId <Container ID>     ContainerId (must be specified if node
                                address is specified)
-nodeAddress <Node Address>     NodeAddress in the format nodename:port
                                (must be specified if container id is
                                specified)

So, to print all the logs for a given application, one can simply say

yarn logs -applicationId <application ID>

On the other hand, if one wants to get the logs of only one container, the following works

yarn logs -applicationId <application ID> -containerId <Container ID> -nodeAddress <Node Address>

The obvious advantage with the command line utility is that now you can use the regular shell utils like grep, sort etc to filter out any specific information that one is looking for in the logs!

Administration

General log related configuration properties

yarn.nodemanager.log-dirs: Determines where the container-logs are stored on the node when the containers are running. Default is ${yarn.log.dir}/userlogs.
- An application’s localized log directory will be found in ${yarn.nodemanager.log-dirs}/application_${appid}.
- Individual containers’ log directories will be below this, in directories named container_{$containerId}.
- For MapReduce applications, each container directory will contain the files stderr, stdin, and syslog generated by that container.
- Other frameworks can choose to write more or less files, YARN doesn’t dictate the file-names and number of files.
yarn.log-aggregation-enable: Whether to enable log aggregation or not. If disabled, NMs will keep the logs locally (like in 1.x) and not aggregate them.

Properties respected when log-aggregation is enabled

yarn.nodemanager.remote-app-log-dir: This is on the default file-system, usually HDFS and indictes where the NMs should aggregate logs to. This should not be local file-system, otherwise serving daemons like history-server will not able to serve the aggregated logs. Default is /tmp/logs.
yarn.nodemanager.remote-app-log-dir-suffix: The remote log dir will be created at {yarn.nodemanager.remote-app-log-dir}/${user}/{thisParam}. Default value is “logs”".
yarn.log-aggregation.retain-seconds: How long to wait before deleting aggregated-logs, -1 or a negative number disables the deletion of aggregated-logs. One needs to be careful and not set this to a too small a value so as to not burden the distributed file-system.
yarn.log-aggregation.retain-check-interval-seconds: Determines how long to wait between aggregated-log retention-checks. If it is set to 0 or a negative value, then the value is computed as one-tenth of the aggregated-log retention-time. As with the previous configuration property, one needs to be careful and not set this to low values. Defaults to -1.
yarn.log.server.url: Once an application is done, NMs redirect web UI users to this URL where aggregated-logs are served. Today it points to the MapReduce specific JobHistory.

Properties respected when log-aggregation is disabled

yarn.nodemanager.log.retain-seconds: Time in seconds to retain user logs on the individual nodes if log aggregation is disabled. Default is 10800.
yarn.nodemanager.log.deletion-threads-count: Determines the number of threads used by the NodeManagers to clean-up logs once the log-retention time is hit for local log files when aggregation is disabled.

Other setup instructions

The remote root log directory is expected to have the permissions 1777 with ${NMUser} and directory and group-owned by ${NMGroup} – group to which NMUser belongs.
Each application level dir will be created with permission 770, but user-owned by the application-submitter and group owned by ${NMGroup}. This is so that the application-submitter can access aggregated-logs for his own sake and ${NMUser} can access or modify the files for log management.
${NMGroup} should be a limited access group so that there are no access leaks.

Conclusion

In this post, I’ve described the motivations for implementing log-aggregation and how it looks for the end user as well as the administrators. Log-aggregation proved to be a very useful feature so far. There are interesting design decisions that we made and some unsolved challenges with the existing implementation – topics for a future post.

The post Simplifying user-logs management and access in YARN appeared first on Hortonworks.

↧

Modern Healthcare Architectures Built with Hadoop

November 21, 2013, 9:58 am

≫ Next: Fast Search and Analytics on Hadoop with Elasticsearch

≪ Previous: Simplifying user-logs management and access in YARN

We have heard plenty in the news lately about healthcare challenges and the difficult choices faced by hospital administrators, technology and pharmaceutical providers, researchers, and clinicians. At the same time, consumers are experiencing increased costs without a corresponding increase in health security or in the reliability of clinical outcomes.

One key obstacle in the healthcare market is data liquidity (for patients, practitioners and payers) and some are using Apache Hadoop to overcome this challenge, as part of a modern data architecture. This post describes some healthcare use cases, a healthcare reference architecture and how Hadoop can ease the pain caused by poor data liquidity.

New Value Pathways for Healthcare

In January 2013, McKinsey & Company published a report named “The ‘Big Data’ Revolution in Healthcare”. The report points out how big data is creating value in five “new value pathways” allowing data to flow more freely. Below we present a summary of these five new value pathways and an an example how Hadoop can be used to address each. Thanks to the Clinical Informatics Group at UC Irvine Health for many of the use cases, described in their UCIH case study.

Pathway	Benefit	Hadoop Use Case
Right Living	Patients can build value by taking an active role in their own treatment, including disease prevention.	Predictive Analytics: Heart patients weigh themselves at home with scales that transmit data wirelessly to their health center. Algorithms analyze the data and flag patterns that indicate a high risk of readmission, alerting a physician.
Right Care	Patients get the most timely, appropriate treatment available.	Real-time Monitoring: Patient vital statistics are transmitted from wireless sensors every minute. If vital signs cross certain risk thresholds, staff can attend to the patient immediately.
Right Provider	Provider skill sets matched to the complexity of the assignment— for instance, nurses or physicians’ assistants performing tasks that do not require a doctor. Also the specific selection of the provider with the best outcomes.	Historical EMR Analysis: Hadoop reduces the cost to store data on clinical operations, allowing longer retention of data on staffing decisions and clinical outcomes. Analysis of this data allows administrators to promote individuals and practices that achieve the best results.
Right Value	Ensure cost-effectiveness of care, such as tying provider reimbursement to patient outcomes, or eliminating fraud, waste, or abuse in the system.	Medical Device Management: For biomedical device maintenance, use geolocation and sensor data to manage its medical equipment. The biomedical team can know where all the equipment is, so they don’t waste time searching for an item.Over time, determine the usage of different devices, and use this information to make rational decisions about when to repair or replace equipment.
Right Innovation	The identification of new therapies and approaches to delivering care, across all aspects of the system. Also improving the innovation engines themselves.	Research Cohort Selection: Researchers at teaching hospitals can access patient data in Hadoop for cohort discovery, then present the anonymous sample cohort to their Internal Review Board for approval, without ever having seen uniquely identifiable information.

Source: The ‘Big Data’ Revolution in Healthcare. McKinsey & Company, January 2013.

At Hortonworks, we see our healthcare customers ingest and analyze data from many sources. The following reference architecture is an amalgam of Hadoop data patterns that we’ve seen with our customers’ use of Hortonworks Data Platform (HDP). Components shaded green are part of HDP.

healthcare-mda

Sources of Healthcare Data

Source data comes from:

Legacy Electronic Medical Records (EMRs)
Transcriptions
PACS
Medication Administration
Financial
Laboratory (e.g. SunQuest, Cerner)
RTLS (for locating medical equipment & patient throughput)
Bio Repository

Device Integration (e.g. iSirona)
Home Devices (e.g. scales and heart monitors)
Clinical Trials
Genomics (e.g. 23andMe, Cancer Genomics Hub)
Radiology (e.g. RadNet)
Quantified Self Sensors (e.g. Fitbit, SmartSleep)
Social Media Streams (e.g. FourSquare, Twitter)

Loading Healthcare Data

Apache Sqoop is included in Hortonworks Data Platform, as a tool to transfer data between external structured data stores (such as Teradata, Netezza, MySQL, or Oracle) into HDFS or related systems like Hive and HBase. We also see our customers using other tools or standards for loading healthcare data into Hadoop. Some of these are:

Health Level 7 (HL7) International Standards
Apache UIMA
JAVA ETL rules

Processing Healthcare Data

Depending on the use case, healthcare organizations process data in batch (using Apache Hadoop MapReduce and Apache Pig); interactively (with Apache Hive); online (with Apache HBase) or streaming (with Apache Storm).

Analyzing Healthcare Data

Once data is stored and processed in Hadoop it can either be analyzed in the cluster or exported to relational data stores for analysis there. These data stores might include:

Enterprise data warehouse
Quality data mart
Surgical data mart
Clinical info data mart
Diagnosis data mart
Neo4j graph database

Many data analysis and visualization applications can also work with the data directly in Hadoop. Hortonworks healthcare customers typically use the following business intelligence and visualization tools to inform their decisions:

Microsoft Excel
Tableau
RESTful Web Services
EMR Real-time analytics
Metric Insights
Patient Scorecards
Research Portals
Operational Dashboard
Quality Dashboards

The following diagram shows how healthcare organizations can integrate Hadoop into their existing data architecture to create a modern data architecture that is interoperable and familiar, so that the same team of analysts and practitioners can use their existing skills in new ways:

Healthcare Ecosystem

As more and more healthcare organizations adopt Hadoop to disseminate data to their teams and partners, they empower caregivers to combine their training, intuition, and professional experience with big data to make data-driven decisions that cure patients and reduce costs.

Watch our blog in the coming weeks as we share reference architectures for other industry verticals.

Download the Datasheet

Datasheet : Build a Modern Data Architecture

A Modern Data Architecture provides the foundation to unleash analytic insights and innovations to compete and win in the modern marketplace. Find out more in this datasheet.

The post Modern Healthcare Architectures Built with Hadoop appeared first on Hortonworks.

↧

Fast Search and Analytics on Hadoop with Elasticsearch

November 25, 2013, 9:53 am

≫ Next: A Roadmap for Hadoop and OpenStack Integration

≪ Previous: Modern Healthcare Architectures Built with Hadoop

Hortonworks customers can now enhance their Hadoop applications with Elasticsearch real-time data exploration, analytics, logging and search features, all designed to help businesses ask better questions, get clearer answers and better analyze their business metrics in real-time.

Hortonworks Data Platform and Elasticsearch make for a powerful combination of technologies that are extremely useful to anyone handling large volumes of data on a day-to-day basis. With the ability of YARN to support multiple workloads, customers with current investments in flexible batch processing can also add real-time search applications from Elasticsearch.

Use Cases

Here are just some of the use case results from Elasticsearch:

Perform real-time analysis of 200 million conversations across the social web each day helping major brands make business decisions based on social data
Run marketing campaigns that quickly identify the right key influencers from a database of 400 million users
Provide real-time search results from an index of over 10 billion documents
Power intelligent search and better inform recommendations to millions of customers a month
Increase the speed of searches by 1000 times
Instant search for 100,000 source code repositories containing tens of billions lines of code

YARN Certified

Elasticsearch became a Hortonworks Certified Technology Partner in June and is the first search tool to be certified on HDP 2 with YARN. A leader, like Hortonworks, in the open source space, this partnership will benefit users of either product. Elasticsearch is a great fit for HDP because its scalable, distributed nature allows it to search – and store – vast amounts of information in near real-time.

Elasticsearch: “We’re excited to partner with Hortonworks and to announce that Elasticsearch is now certified with Hortonworks Data Platform 2.0 to make real-time data exploration faster on Hadoop,” said Steven Schuurman, CEO of Elasticsearch. “Hadoop and Elasticsearch are among the most popular open source products currently being run in production within the Enterprise. Our advanced open source search and analytics engine combined with Hortonworks open source Hadoop makes a powerful big data solution for customers embarking on big data projects.”

Learn More

Using Elasticsearch with HDP is easy thanks to Elasticsearch integrations. Developers can write MapReduce jobs that index existing data in HDFS, enabling search through the Elasticsearch REST API and related ecosystem. Developers can also enable MapReduce jobs to read and write the input and output datasets to and from Elasticsearch. This deep integration extends to Hive, Pig and Cascading.

Read more about the Elasticsearch-Hortonworks partnership, Elasticsearch blog, or Elasticsearch technical guides on Apache Hive, Apache Pig, Cascading and Map/Reduce.

The post Fast Search and Analytics on Hadoop with Elasticsearch appeared first on Hortonworks.

↧

A Roadmap for Hadoop and OpenStack Integration

December 5, 2013, 2:07 pm

≫ Next: Apache Falcon Technical Preview Available Now

≪ Previous: Fast Search and Analytics on Hadoop with Elasticsearch

A recent survey conducted by the OpenStack foundation shows incredible adoption in the enterprise. Cost savings and operational efficiency stand out as the top business motivators that are driving broad adoption of OpenStack across industry verticals. It was of particular interest to see that roughly 30% of the deployments are in production. Above all, I was definitely not surprised to see Hadoop amongst the top 10 workloads on OpenStack.

Hadoop is the Perfect App for OpenStack

Many of our customers are looking towards Hadoop as a greenfield use case for OpenStack because Hadoop, unlike other enterprise applications, has very few legacy processes attached to it. It eliminates the obstacles to adoption.

However, as Hadoop gains broad acceptance across organization, more customers face the classic operational challenges of installation, management and support of multiple clusters for multiple test/dev/staging/production environments. It is complex. All the while, they must keep operational expenses and upfront capital expense outlays to a minimum. This problem is more pronounced for dev/test Hadoop clusters as these are being provisioned/decommissioned most frequently and undergo maximum change. The infrastructure level flexibility provided by OpenStack holds the key to solving this problem.

An Open Roadmap

For the last ten months we have been closely collaborating with the OpenStack community to enable Hadoop to run on OpenStack through project Savanna – efficiently and with a 100% open source approach. Thus far, a lot of work has been completed but there is a lot left to be done. As part of our commitment to the open community we have published our roadmap for this integration to our website and we invite you to not only review this work, but to get involved.

We will continue to update the labs page with the key use cases we uncover from our customers and the features to address those use cases. You will also be able to keep tabs on when these features will be available, where to download them and all the documentation and information around them.

Please take a look at progress so far and get involved!

The post A Roadmap for Hadoop and OpenStack Integration appeared first on Hortonworks.

↧

Apache Falcon Technical Preview Available Now

December 6, 2013, 9:31 am

≫ Next: Apache Ambari graduates to Apache Top Level Project!

≪ Previous: A Roadmap for Hadoop and OpenStack Integration

falcon-logo We believe the fastest path to innovation is the open community and we work hard to help deliver this innovation from the community to the enterprise. However, this is a two way street. We are also hearing very distinct requirements being voiced by the broad enterprise as they integrate Hadoop into their data architecture.

Take a look at the Falcon Technical Preview and the Data Management Labs.

Open Source, Open Community & An Open Roadmap for Dataset Management

Over the past year, a set of enterprise requirements has emerged for dataset management. Organizations need to process and move datasets (whether HDFS files or Hive Tables) in, around and between a clusters. This task can start innocently enough but usually (and quickly) becomes very complex. Dataset locations, cluster interfaces, replication and retention policies can all change over time. And hand-coding this logic into your applications — along with general retry and late data arrival logic — can become a slippery-slope of complexity. Getting it right the first time can be a challenge but maintaining the end result can be downright impossible.

To meet these requirements, we will, as always, work within the community to deliver them and we have introduced a Hortonworks Labs initiative to make dataset management easier. This initiative outlines a public roadmap that will deliver the features that will help Hadoop users avoid the complexity of processing and managing datasets. Much of the work is outlined in Apache Falcon which provides a declarative framework for describing data pipelines to simplify the development of processing solutions. By using Falcon, users can describe dataset processing pipelines in a way that maximizes reuse and consistency, while insulating from the implementation details across datasets and clusters.

We invite you to review and follow the roadmap in our Labs area and also encourage you to get involved in the community.

If you want get started today using some of these tools, we have made a Falcon Technical Preview available.

The post Apache Falcon Technical Preview Available Now appeared first on Hortonworks.

↧

Apache Ambari graduates to Apache Top Level Project!

December 6, 2013, 9:52 am

≫ Next: Apache Tez 0.2.0 Released

≪ Previous: Apache Falcon Technical Preview Available Now

We are very excited to announce that Apache Ambari has graduated out of Incubator and is now an Apache Top Level Project! Hortonworks introduced Ambari as an Apache Incubator project back in August 2011 with the vision of making Hadoop cluster management dead simple. In little over two years, the development community grew significantly, from a small team in Hortonworks, to a large number of contributors from various organizations beyond Hortonworks; upon graduation, there were more than 60 contributors, 37 of whom had become committers. The Ambari user base has been steadily growing, with many organizations relying on Ambari to deploy and manage Hadoop clusters at scale in their BIG data centers today.

Getting ready to graduate took a lot of hard work and dedication from everyone in the community. A big thank you to all the contributors, committers, Project Management Committee members, mentors, and users! Ambari would not be where it is today without your tireless effort and trust in the project.

Apache Ambari became a TLP by growing into a viable open-source project that follows the core principles of the Apache Software Foundation (ASF): building community and promoting diversity. We are proud to be part of ASF and will continue to innovate in support of those principles.

To find out more about Ambari, please visit the Apache Ambari Project page. You can also join the Ambari User Group and attend Meetup events.

To see Ambari in action, give Hortonworks Data Platform a try — the only Hadoop platform that includes Ambari, the 100% open source Hadoop management solution for the enterprise.

The post Apache Ambari graduates to Apache Top Level Project! appeared first on Hortonworks.

↧

Apache Tez 0.2.0 Released

December 9, 2013, 9:35 am

≫ Next: Hadoop Security : Today and Tomorrow

≪ Previous: Apache Ambari graduates to Apache Top Level Project!

The Apache Tez team is proud to announce the first release of Apache Tez – version 0.2.0-incubating.

Apache Tez is an application framework which allows for a complex directed-acyclic-graph of tasks for processing data and is built atop Apache Hadoop YARN. You can learn much more from our Tez blog series tracked here.

Since entering the Apache Incubator project in late February of 2013, there have been over 400 tickets resolved, culminating in this significant release. We would like to thank all the contributors that made this release possible. The release bits are available at: http://www.apache.org/dyn/closer.cgi/incubator/tez/tez-0.2.0-incubating/.

The Apache Hive and Apache Pig projects are also progressing well towards making the necessary changes to evolve their execution engine from just being on top of MapReduce to also work atop Tez. To follow their work, you can look at the HIVE-4660 and PIG-3446 JIRA tickets respectively.

To start playing with Tez, you can setup an Apache Hadoop-2.2.0 cluster manually or use Hortonworks HDP 2.0 GA sandbox and then follow the instructions from the Tez Build+Install guide.

Thanks,

Hitesh Shah, Siddharth Seth and Bikas Saha on behalf of The Apache Tez Team

Find out more about Apache Tez, and how it is a crucial part of the Stinger Initiative.

The post Apache Tez 0.2.0 Released appeared first on Hortonworks.

↧

Hadoop Security : Today and Tomorrow

December 9, 2013, 2:54 pm

≫ Next: Enterprise Hadoop Market in 2013: Reflections and Directions

≪ Previous: Apache Tez 0.2.0 Released

Security is a top agenda item and represents critical requirements for Hadoop projects. Over the years, Hadoop has evolved to address key concerns regarding authentication, authorization, accounting, and data protection natively within a cluster and there are many secure Hadoop clusters in production. Hadoop is being used securely and successfully today in sensitive financial services applications, private healthcare initiatives and in a range of other security-sensitive environments. As enterprise adoption of Hadoop grows, so do the security concerns and a roadmap to embrace and incorporate these enterprise security features has emerged.

An Open Roadmap for Security

We recently published an open roadmap for security in Hadoop. This effort outlines a community based approach to delivering on the key requirements that are being asked for by some of the largest and most security stringent organizations on the planet. Our goal is to incite and work within the community to deliver flexible, comprehensive, and integrated security controls for Hadoop. You can check out this roadmap in the labs section of our website, but let’s see what we can do for security today.

Securing a Hadoop cluster today

The distributed nature of Hadoop, a key for its success, poses unique challenges in securing it. Securing a client-server system is often easier because security controls can be placed at the service, the single point of access. Securing a system requires a redundant and layered approach. With Hadoop this requirement is exaggerated by the complexity of a distribution. Many of the layers are in place today to help you secure a cluster. Lets review some of the tools available across the four pillars.

Authentication verifies the identity of a system or user accessing the system. Hadoop provides two modes of authentication. The first, Simple or Pseudo authentication, essentially places trust in user’s assertion about who they are. The second, Kerberos, provides a fully secure Hadoop cluster. In line with best practice, Hadoop provides these capabilities while relying on widely accepted corporate user-stores (such as LDAP or Active Directory) so that a single source can be used for credential catalog across Hadoop and existing systems.

Authorization specifies access privileges for a user or system. Hadoop provides fine-grained authorization via file permissions in HDFS and resource level access control (via ACL) for MapReduce and coarser grained access control at a service level. For data, HBase provides authorization with ACL on tables and column families and Accumulo extends this even further to cell level control. Also, Apache Hive provides coarse grained access control on tables.

Accounting provides the ability to track resource use within a system. Within Hadoop, insight into usage and data access is critical for compliance or forensics. As part of core Apache Hadoop, HDFS and MapReduce provide base audit support. Additionally, the Apache Hive metastore records audit (who/when) information for Hive interactions. Finally, Apache Oozie, the workflow engine, provides audit trail for services.

Data Protection ensures privacy and confidentiality of information. Hadoop and HDP allow you to protect data in motion. HDP provides encryption capability for various channels such as Remote Procedure Call (RPC), HTTP, JDBC/ODBC, and Data Transfer Protocol (DTP) to protect data in motion. Finally, HDFS and Hadoop supports operating system level encryption.

Securing a Hadoop cluster tomorrow

There is a lot of innovation around security in Hadoop today. There is a lot of focus on making all these security frameworks work together and to make them simple to manage. Here are some of the ways Hortonworks is leading Hadoop security enhancements and is bringing enterprise security to Hadoop.

Perimeter level Security With Apache Knox
Apache Hadoop has Kerberos for authentication. However, some organizations require integration with their enterprise identity management and Single Sign-On (SSO) solutions. Hortonworks created Apache Knox Gateway (Apache Knox) to provide Hadoop cluster security at the perimeter for REST/HTTP requests and to enable the integration of enterprise identity-management solutions. Apache Knox provides integration with corporate identity systems such as LDAP, Active Directory (AD) and will also integrate with SAML based SSO and other SSO systems.

Apache Knox also protects a Hadoop cluster by hiding its network topology to eliminate the leak of network internals. A network firewall may be configured to deny all direct access to a Hadoop cluster and accept only the connections coming from the Apache Knox Gateway over HTTP. These measures dramatically reduce the attack vector.

Finally, Apache Knox promotes the use of REST/HTTP for Hadoop access. REST is proven, scalable, and provides client interoperability across languages, operating systems, and computing devices. By using Hadoop REST/HTTP APIs through Knox, clients do not need a local Hadoop installation.

Improved Authentication
Hortonworks is working with the community on a few ongoing projects to extend authentication to beyond Kerberos and provide token-based authentication. The JIRAs tracking this work are HADOOP-9392, HADOOP-9479, HADOOP-9533 and HADOOP-9804.

Granular Authorization
Authorization mechanisms in Hadoop components are getting enhanced across the board. Most Hive users want a familiar database-style authorization model. Hortonworks is working to enhance the Hive authorization to bring SQL GRANT and REVOKE semantic in Hive and make it fully secure. This enhancement will also bring row and column level security to Hive. The JIRA for this work is HIVE-5837 and the enhancement is expected in Q1, 2014. HDFS-4685 will bring ACL support in HDFS.

Accounting & Audit
There are enhancements planned to provide better reporting with audit log correlation and audit viewer capability in Hadoop. With audit log correlation an auditor will be able to answer what sequence of operations John Doe did across Hadoop components without requiring external tools. In future, with Ambari, an auditor will be able to see John Doe’s actions (such as read HDFS files and submit some MapReduce jobs) as one logical operation. In addition, Apache Knox plans to provide billing capability to record REST API usage for Hadoop.

Protecting data with Encryption
For encryption of data-in-motion the Hadoop ecosystem will continue to provide encryption of channels not already covered and to offer more effective encryption algorithm for all channels. In Q1, 2014 HDP will provide SSL support for Hive Server2. Further enhancements are in the works to provide encryption in Hive, HDFS and HBase.

Conclusion

Hadoop is a secure system and offers key features for securely processing enterprise data. But the security work never ends. Hortonworks is working on several projects to enhance Hadoop security from the inside, shore up defenses from the outside with Apache Knox and to keep up with evolving requirements by providing more flexible authentication and authorization and by improving data protection. We are also working to improve integration with enterprise Identity Management and security systems.

We encourage you to review this roadmap for Hadoop security and to get involved. Also, over the next few months we will publish more best practices, covering each pillar of security in more detail. Stay tuned for next post about wire encryption.

The post Hadoop Security : Today and Tomorrow appeared first on Hortonworks.

↧

Enterprise Hadoop Market in 2013: Reflections and Directions

December 9, 2013, 5:10 pm

≫ Next: Announcing the Technical Preview of Apache Knox Gateway

≪ Previous: Hadoop Security : Today and Tomorrow

2013 was certainly a revealing year for the Enterprise Hadoop market. We witnessed the emergence of the YARN-based architecture of Hadoop 2 and a strong ecosystem embracement that will fuel its next big wave of innovation. The analyst community accurately predicted Hadoop’s market momentum would greatly accelerate, but none predicted a pure play vendor would publicly declare its intent to pivot away from the Enterprise Hadoop market. Interesting times indeed!

Join us on Tuesday January 21^st where we’ll be covering the Enterprise Hadoop State of the Union in more detail.

What Analysts Had to Say

Mike Gualtieri at Forrester sums up the state of Hadoop quite nicely: “Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data”.

Further, Tony Baer from Ovum talks about how “Hadoop is becoming a more ‘normal’ software market” and the “Hadoop vendor ecosystem [is] gaining critical mass”.

While Hadoop’s momentum has been impressive, its 2^nd act has only just begun. In YARN: Weaving the Future of Hadoop, Robin Bloor talks about how “YARN is the major innovation in Hadoop 2.0” and “With Hadoop 2.0 we expect this ecosystem to grow like bamboo in spring time”. He states that thanks to YARN “Hadoop is no longer just a MapReduce-based batch environment. You will be able to run many applications on it concurrently.” – spanning batch, interactive, online, and streaming use cases all running IN Hadoop for example. Bloor discussed the new data processing engine called Apache Tez that runs on YARN and “provides a customizable framework for low latency and high throughput workloads” aimed at quenching the need for speed for technologies like Apache Hive (via the Stinger initiative), Apache Pig, Cascading and many commercial systems previously dependent on batch-only MapReduce.

And speaking of speed, GigaOm wrote about “non-batch functionality for Hadoop thanks to YARN, which lets Hadoop run all sorts of processing frameworks” – highlighting the integration of real-time stream processing via Apache Storm, with Enterprise Hadoop.

Bottom-line: We’re not in Kansas (aka MapReduce land) anymore. With the introduction of YARN and Tez, analysts agree that the Enterprise Hadoop market is entering an even more compelling wave of innovation that is certain to see further adoption from major technology vendors.

So Now’s The Time To Pivot Away From The Enterprise Hadoop Market, Right?

One company’s rationale for trying to pivot away from the Enterprise Hadoop market to focus on its own proprietary big data product appears driven by the desire for more control and immediate profits. The pivot clearly discounts the power of community-driven open source innovation and its ability to outpace any single vendor’s agenda.

Cases in point: the progress of Apache Ambari (open source Hadoop management) and the Stinger Initiative (enhancing the Speed, Scale and SQL support of Apache Hive) have demonstrated that proprietary technology add-ons ultimately result in limited differentiation in the face of community collaboration. Stinger Phase 2, for example, included contributions from no less than 10 commercial entities including Microsoft, Facebook, SAP and many more.

Hortonworks has maintained a consistent focus enabling Hadoop to be an enterprise-viable data platform that uniquely powers a new generation of data-driven applications and analytics. Our vision where “half the world’s data will be processed by Apache Hadoop” keeps us laser-focused on innovating Enterprise Hadoop in the open and unlocking the broader opportunities beyond just Hortonworks.

In his Notes from Hadoop World paper, Cowen & Co. analyst Peter Goldmacher sums up the big, bigger and biggest opportunities quite nicely:

“We believe Hadoop is a big opportunity and we can envision a small number of billion dollar companies based on Hadoop. We think the bigger opportunity is Apps and Analytics companies selling products that abstract the complexity of working with Hadoop from end users and sell solutions into a much larger end market of business users. The biggest opportunity in our mind, by far, is the Big Data Practitioners that create entirely new business opportunities based on data where $1M spent on Hadoop is the backbone of a $1B business.”.

While Enterprise Hadoop plays a Big role, unlocking its long-term potential hinges on helping companies like HP, Microsoft, Rackspace, SAP and Teradata bring more data under management in order to unlock the Bigger opportunity that Goldmacher describes. Making Hadoop easier to use and consume by mainstream enterprises accelerates its time to value – fueling the Biggest opportunities for the end users themselves.

Using 15 Years of Enterprise Linux As A Guide

Over the past few months, there’s been an interesting debate over which business model for Enterprise Hadoop is best. Jeff Kelly summarized things in his SiliconAngle article “The Better Model for Hadoop: Open Source or Proprietary Approach?”. Needless to say, Hortonworks believes a 100% open source approach is best, and I’d like to use Red Hat’s growth over the past 15 years to draw out some key points for the discussion.

redhatrevenue

While there are many drivers to Red Hat’s hockey stick growth, the key items to me are:

its long-term commitment to enable strategic partners;
an unwavering focus on building out “Enterprise Linux” in a 100% open source way that’s distinct but related to the innovation happening in the upstream community;
the fortitude to rally a broad ecosystem around the platform in a way that drives far more value for the solutions built and deployed on top of Enterprise Linux, and
its scalable software subscription model (that compounds annually in a way that’s similar to traditional software maintenance revenues or Software-as-a-Service subscription revenues)

Red Hat placed bets and investments early on designed to grow the market with strategic partners like IBM, HP, and Oracle, as well as to enable a broad ecosystem of solutions built on its platform. The curve above illustrates this long game perfectly, including the compounding power of an open source software subscription model.

Open Innovation Provides Fastest Path To Value for All

Completing the comparison, Red Hat established a vision for 100% open source Enterprise Linux that extended far beyond the work happening on the Linux Kernel.

At Hortonworks, our vision for Enterprise Hadoop is logically similar since it spans innovations within Apache Hadoop (i.e. the “kernel”) as well as a range of 100% open source projects focused on addressing the core platform, data, and operational services required in an enterprise data platform.

We believe that by focusing on innovating within the open source community (specifically Apache Software Foundation projects) with a focus on making Hadoop easy to use and consume by the enterprise and deeply integrated with key datacenter technologies, the platform will get pulled into enterprises far more quickly than a model that’s closed and/or driven by a single vendor.

Bottom-line: We’re committed to innovating in the open since it provides the fastest and most transparent path to value for everyone using Enterprise Hadoop. And unlike others, the long-term viability of our business is predicated on us always honoring this commitment to our customers and partners. That’s how we ALL win.

So, What’s In Store for 2014?

While I touched on the key innovations happening with YARN, Tez, and Stinger above, there’s even more happening beyond that spanning Security, Dataset Management, OpenStack-powered Clouds, and beyond.

Rather than cover these details now, let’s get together on Tuesday January 21^st where we’ll be covering the Enterprise Hadoop State of the Union in more detail.

—————————————

Sources used for Red Hat chart:

The post Enterprise Hadoop Market in 2013: Reflections and Directions appeared first on Hortonworks.

↧

Announcing the Technical Preview of Apache Knox Gateway

December 10, 2013, 1:42 pm

≫ Next: How To Secure Apache Sqoop Jobs with Oracle Wallet

≪ Previous: Enterprise Hadoop Market in 2013: Reflections and Directions

Just yesterday, we talked about our roadmap for Security in Enterprise Hadoop. At our Security labs page you can see in one place the security roadmap and efforts underway across Hadoop and their timelines.

Security is often described as rings of defense. Continuing this analogy the Apache community has been working to create a perimeter security solution for Hadoop. This effort is Apache Knox Gateway (Apache Knox) and we are happy to announce the Technical Preview of Apache Knox. With this technical preview Apache Knox is one step closer to general availability.

Apache Knox is the Web/REST API Gateway solution for Hadoop. It provides a single access point to access all of Hadoop resources over REST. It also enables the integration of enterprise identity management solutions and provides numerous perimeter security features for REST/HTTP access to Hadoop.

Get Involved

You can be involved in many ways: from providing feedback on the roadmap, to contributing code to trying out Apache Knox Gateway.

Thank You,

The Apache Knox and Security Team

The post Announcing the Technical Preview of Apache Knox Gateway appeared first on Hortonworks.

↧

How To Secure Apache Sqoop Jobs with Oracle Wallet

December 11, 2013, 11:44 am

≫ Next: Storm Technical Preview Available Now!

≪ Previous: Announcing the Technical Preview of Apache Knox Gateway

Apache Sqoop is a tool that transfers data between the Hadoop ecosystem and enterprise data stores. Sqoop does this by providing methods to transfer data to HDFS or Hive (using HCatalog). Oracle Database is one of the databases supported by Apache Sqoop. With Oracle Database, the database connection credentials are stored in Oracle Wallet. Oracle Wallet can act as the store of keys and secrets such as authentication credentials. This post describes how Oracle Wallet adds a secure authentication layer for Sqoop jobs.

In order to connect to an external database, Sqoop users must provide a set of credentials specific to that data store. These credentials are typically in the form of user name and password.

Some enterprises may already be standardized on credential management tools provided by their enterprise DBMS vendor. These companies may not be comfortable with any of Sqoop’s three authentication methods:

Password provided on the command line
Password read from the console during the interactive execution of a Sqoop job.
Password provided on a secure file system that only the user can access.

With Oracle Database, the database connection credentials can be securely stored in an Oracle Wallet. The use of Oracle Wallet can allay the security concerns mentioned above because it provides a secure client-side software container allowing secure storage of authentication and signing credentials.

With support for Oracle Wallet, Sqoop jobs no longer need to embed usernames and passwords. This reduces the risk of exposing credentials and eases enforcement of authentication policies, since application code and scripts need not change whenever usernames or passwords change.

Instructions for Using Oracle Wallets for External Password Store

Please note that the following steps were tested with Oracle 11gR2 database. We expect these instructions should be applicable to other versions of Oracle DB, but they were not tested on versions other than Oracle 11gR2.

At a high-level, these are the steps (which we describe in detail below):

Create an Oracle client-side wallet
Create tnsnames.ora and sqlnet.ora files
Add the database access credentials to the Oracle wallet
Test the Oracle wallet
Use the Oracle wallet for Sqoop jobs
Modify the JDBC URL for the connection to use the wallet
Run the Sqoop job with Oracle wallet

Create an Oracle Client-Side Wallet

The first step is to create an Oracle wallet to hold the credentials used by Sqoop jobs. Oracle provides various tools to manage the Oracle Wallet. This document describes one of those: the mkstore command line tool. The mkstore command can be used to create the wallet:

mkstore -wrl <wallet_location> -create

That command line creates a client side wallet at the location <wallet_location>. Substitute <wallet_location> to your valid directory location.

It will be an auto-logon wallet. So before a client process can use the wallet, the ownership of the wallet directory and files (owner and group) must match the process uid and gid.

Make sure that the password matches the database minimum requirements.

Create tnsnames.ora and sqlnet.ora Files

The Oracle database uses a few network configuration files. The files tnsnames.ora and sqlnet.ora are used for configuring client side network access.

tnsnames.ora

The configuration file tnsnames.ora has client side local naming parameters. The contents of tnsnames.ora are used to map the network service name to an Oracle database connection descriptor.

Here is an example of a tnsnames.ora file:

W_ORCL =
  ( DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.64.51)(PORT = 1521))
    (CONNECT_DATA =
    (SERVER = DEDICATED)
    (SERVICE_NAME = orcl)
  )
)

In the above tnsnames.ora file, the network service name W_ORCL is mapped to a connection descriptor accessing a database instance named ORCL running on a host with an IP address 192.168.64.51.

sqlnet.ora

The configuration file sqlnet.ora is used to specify client side network configuration properties that influence the connection profiles. This file configures tracing, authentication, routing, and advanced security parameters.

Here is an example of a sqlnet.ora file:

WALLET_LOCATION =
  (SOURCE =
    (METHOD = FILE)
    (METHOD_DATA =
      (DIRECTORY = <full path to file>)
    )
  )

<b>SQLNET.WALLET_OVERRIDE = TRUE</b>
SSL_CLIENT_AUTHENTICATION = FALSE

The tnsnames.ora and sqlnet.ora files can be placed in separate directories or they can both be placed in the wallet directory.

Add the Database Access Credentials to the Oracle Wallet

Now that we’ve created an Oracle Wallet, we need to populate it with the database access credentials. We can use the -createCredential option to the mkstore command to do this:

mkstore -wrl <wallet-location> -createCredential <DBSvc> <DBUser> <DBPassword>

<wallet-location> is the directory where the wallet files will be stored
<DBSvc> is the service name defined in tnsnames.ora
<DBUser> is the username to access the database
<DBPassword> is the database password for the user <DBUser>

For example, with the sample service defined in tnsames.ora, and a DB user sqoop using password sqooppwd, the command invocation would be:

mkstore -wrl <wallet-location> -createCredential w_orcl sqoop sqooppwd

This will prompt for the wallet password, which should be same as the one used when creating the wallet.

Test the Oracle Wallet

A quick way to test the password is to attempt a DB session with SQLPLUS.

If the tnsnames.ora and sqlnet.ora settings are not in the default location ($ORACLE_HOME/network/admin), then you can provide the location of these files using the TNS_ADMIN environment variable.

Assuming the tnsnames.oraand sqlnet.ora are under $HOME/wallet_test, do the following:

export TNS_ADMIN=$HOME/wallet_test
sqlplus /@&lt;svc_name&gt; # in our case w_orcl

This command should successfully establish the connection. If it does not, check the tnsnames.ora contents to make sure the DB host, port or service name are correct.

If the service name in tnsnames.ora changes, then the wallet entry has to be created/modified.

Use the Oracle Wallet for Sqoop Jobs

After successfully validating the wallet, it can be used for Sqoop jobs. There are a few steps for providing the wallet with Sqoop:

Provide the wallet to the Sqoop launcher
Provision the wallet to the mapper tasks
Make command line changes for the Sqoop launcher program
Specify the location of the wallet and Oracle configuration files to the mapper tasks
Specify the option to localize the files to the mapper tasks
Modify the JDBC URL

Provide the Wallet to the Sqoop Launcher

For the Sqoop client program and the mappers launched by the Sqoop job, we have to specify the location for the wallet and the tnsnames.ora and sqlnet.ora files.

Add the following files to the $SQOOP_HOME/lib directory (typically SQOOP_HOME is set to /usr/lib/sqoop):

oraclepki.jar
osdt_cert.jar
osdt_core.jar
ojdbc6.jar

These jars are available as part of an Oracle Database installation.

Provision the Wallet to the Mapper Tasks

Copy the contents of the wallet directory from the database host along with the tnsnames.ora and sqlnet.ora files to a folder, such as $HOME/wallet, as the operating system user who is launching the Sqoop command.

The wallet directory will have the following files after this step:

cwallet.sso
ewallet.p12
sqlnet.ora
tnsnames.ora

The first two are the wallet files and the next two are the Oracle client network configuration files that we saw before.

Make sure the wallet directory and the files in that directory are owned by the operating system user and the group ownership is the same as the Sqoop user group.

Make Command Line Changes for the Sqoop Launcher Program

Since the wallet location and the TNS locations are different from the defaults, we have to override the location of the wallet and the tnsnames.ora and sqlnames.ora files.

Oracle JDBC exposes two properties for this purpose:

oracle.net.tns_admin – Location of the tnsnames.ora and sqlnet.ora files
oracle.net.wallet_location – Location of the wallet files, cwallet.sso and ewallet.p12

In this case we will set both these properties to $HOME/wallet.

Setting these two system properties for Sqoop can be achieved by setting the system environment variable HADOOP_OPTS which will be used for setting additional Java options to the JVM.

export HADOOP_OPTS= "-Doracle.net.tns_admin=$HOME/wallet -Doracle.net.wallet_location=$HOME/wallet_test"

Specify the Location of the Wallet and Oracle Configuration Files to the Mapper Tasks

While the client program uses the DB for retrieving metadata, the mapper tasks do the actual data transfer. To provide the wallet files to the mapper tasks, we specify them as part of the –files tool option to localize to each mapper. Also, we have to use the Hadoop configuration properties to pass specific Java command line options (similar to HADOOP_OPTS for the launcher).

Specify the Option to Localize the Files to the Mapper Tasks

This can be achieved by adding the –files option to the Sqoop command line.

Please note that Hadoop options have to come before any Sqoop options. The option and option argument can be specified as follows:

-files $HOME/wallet/cwallet.sso,\
$HOME/wallet/ewallet.p12,\
$HOME/wallet/sqlnet.ora,\
$HOME/wallet/tnsnames.ora

We also have to tell the Mapper tasks the location of wallet and TNS files. We use the hadoop configuration parameter mapred.child.java.opts to provide additional Java options to the mappers.

Note that we use the current directory as the location (‘.’). This is because the files provided by the –files option will be localized to the current directory for each mapper task.

-D mapred.map.child.java.opts=’-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.'

If there are additional Java options that need to be provided for the mappers, they should be added to the Java options given below. For example, to add the system property -Djava.security.egd=file:/dev/./urandom, the options would be specified as:

-D mapred.map.child.java.opts=’-Doracle.net.tns_admin=. -Doracle.net.wallet_location=. –Djava.security.egd=file:/dev/./urandom'

Modify the JDBC URL

The JDBC URL provided should use the format jdbc:oracle:thin:@SVC where the SVC is the service name used for accessing the database.

This is added as part of the createCredential command above and should also be in the tnsnames.ora file.

Run the Sqoop Job with Oracle Wallet

Please make sure that Apache Sqoop v1.4.5 is used or (if using a prior version of Sqoop) install the OracleManager fix for wallet support.

When we use wallet-based authentication, we don’t provide the username or password for establishing the JDBC connection.

Sqoop code has been enhanced to properly handle this case (please see SQOOP-1224 for more information). Use Oracle session user when the database user name is not explicitly provided.

This will be part of Sqoop 1.4.5 and later versions. For versions of Sqoop prior to 1.4.5, please make sure that the Sqoop product has this patch applied.

Run the Sqoop Job

Now we are ready to use the wallet with the Sqoop job.

Here is an excerpt from a script that can be used after the wallet and Oracle client configuration setup are complete. This combines all of the steps previously outlined in this post.

export HADOOP_OPTS= "-Doracle.net.tns_admin=$PWD/wallet -Doracle.net.wallet_location=$PWD/wallet "
sqoop {import|export} \
-D mapred.map.child.java.opts= '-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.' \
-files $HOME/wallet/cwallet.sso,\
$HOME/wallet/ewallet.p12,\
$HOME/wallet/sqlnet.ora,\
$HOME/wallet/tnsnames.ora \
-connect ‘jdbc:oracle:thin:@w_orcl’ \
&lt;&lt;rest of sqoop options&gt;&gt;

The post How To Secure Apache Sqoop Jobs with Oracle Wallet appeared first on Hortonworks.

↧

Storm Technical Preview Available Now!

December 12, 2013, 2:11 pm

≫ Next: Getting Started Writing YARN Applications

≪ Previous: How To Secure Apache Sqoop Jobs with Oracle Wallet

In October, we announced our intent to include and support Storm as part of Hortonworks Data Platform. With this commitment, we also outlined and proposed an open roadmap to improve the enterprise readiness of this key project. We are committed to doing this with a 100% open source approach and your feedback is immensely valuable in this process.

Today, we invite you to take a look at our Storm technical preview. This preview includes the latest release of Storm with instructions on how to install Storm on Hortonworks Sandbox and run a sample topology to familiarize yourself with the technology. This is the final pre-Apache release of Storm.

Open roadmap for integrating real-time stream processing with Hadoop

All of our customers have proven the value of Hadoop in classic batch and interactive use cases using Apache Hive and HBase and now they look to extend and extract further value from their clusters with real-time processing. To this end, the community has rallied around Storm to fulfill this need. And in order to do this effectively, Storm needs complete, end-to-end integration with Hadoop. This requires providing a uniform management layer across Storm & Hadoop, enabling resource sharing and isolation between Storm & Hadoop clusters and improvements to Storm’s enterprise connectivity and multi-tenancy. That is exactly what we are working on next.

To get more information on the detailed roadmap and to follow our progress, visit our Hortonworks Labs page for Storm.

The post Storm Technical Preview Available Now! appeared first on Hortonworks.

↧

Getting Started Writing YARN Applications

December 13, 2013, 9:16 am

≫ Next: Hortonworks Data Platform 2.0 on openjdk

≪ Previous: Storm Technical Preview Available Now!

There is a lot of information available on the benefits of Apache YARN but how do you get started building applications? On December 18 at 9am Pacific Time, Hortonworks will host a webinar and go over just that: what independent software vendors (ISVs) and developers need to do to take the first steps towards developing applications or integrating existing applications on YARN.

Register for the webinar here.

Why YARN?

As Hadoop gains momentum it’s important to recognize the benefits to customers and the competitive advantage software vendors will have if their application is integrated with YARN like elasticity, reliability and efficiency. We will take a few minutes to drill into these benefits and the importance of getting started on YARN now.

Hear from a company who has integrated their application on YARN

We have a number of partners who have gone through the process and it’s always helpful to hear about their experiences. We have invited Actian to be our guest speaker to share their business drivers, lessons learned and advice to other developers embarking on YARN application development. Actian had developed big data analytics platforms for Hadoop and ETL/DQ for Hadoop that include a wide array of ready-to-use data access, data quality, de-dup, data preparation, distributed analytics and machine learning capabilities.

Who should attend?

The webinar is aimed at software vendors and developers interested in building applications on YARN. System integrators that will be helping their customers get the most out of their Hadoop infrastructure will also find this webinar helpful.

Step by step guidance for building YARN applications

The goal of the webinar is to help you get started in a step by step method, along with the resources available to support your goals. We look forward to seeing you there and also hearing from those that have experiences to share.

Register now, Dec. 18 at 9am PT.

The post Getting Started Writing YARN Applications appeared first on Hortonworks.

↧

Hortonworks Data Platform 2.0 on openjdk

December 13, 2013, 11:15 am

≫ Next: Downloads for Storm, Falcon, Knox Gateway and Tez

≪ Previous: Getting Started Writing YARN Applications

Apache Hadoop has always been very fussy about Java versions. It’s a big application running across tens of thousands of processes across thousands of machines in a single datacenter. This makes it almost inevitable that any race conditions and deadlock bugs in the code will eventually surface – be it in the Java JVM and libraries, in Hadoop itself, or in one of the libraries on which it depends.

Hence the phrase “there are no corner cases in a datacenter”. It may be amusing, but it makes a point: over time what bugs there are the software stack of a datacenter will eventually surface.

Hadoop, the applications on top, and their dependency libraries are the core of what we qualify when our QA team does a release of the official Apache Hadoop binaries -as it has done on the core Hadoop projects for every production-quality release of Hadoop. It is also the core of what we test when making an HDP release -qualifying the stack on top of those Apache releases.

Testing the JVM is an implicit part of this -which is why we always state which Java versions we have tested on and support. Usually these supported versions are behind the latest Sun/Oracle releases. For a long time Hadoop was only recommended “in production” on on specific versions of Oracle Java 1.6 . Indeed, HDP-1 is still only supported on these. Nowadays getting a supported Java 1.6 version is hard as its hidden away in the Java Archive Download pages. When you do download the JDK, the installation process involves click through licenses making automating deployment and maintenance that much harder.

Which is why for HDP-2 we are pleased to announce that not only is it tested and supported on the Oracle 1.7.0_21 JDK alongside the 1.6.0_31 version, we’ve also qualified it against openjdk-1.7.0_09-icedtea

As a result, HDP-2 offers a new way to install a supported JDK:

yum -y install openjdk-7

Now, you can install the openjdk JDK and have yum keep it up to date. That is not just for developer and proof-of-concept systems, that is production clusters of hundreds to thousands of nodes which is the same scale at which we test HDP releases.

Not only does this simplify deployment and other operations tasks , it also starts to pave the way for closer links between the OpenJDK team and the Hadoop developer community. The functionality and performance of the JVM is critical to Hadoop – and if we can get better insight into how the open JVMs work, if we can get the OpenJDK team to have Hadoop on their list of key applications to care about, we can become more confident that future openjdk releases will work even better with Hadoop.

Of course, this is all in the future. But maybe we can view that yum -y install openjdk-7 as the beginning.

The post Hortonworks Data Platform 2.0 on openjdk appeared first on Hortonworks.

↧

Downloads for Storm, Falcon, Knox Gateway and Tez

December 16, 2013, 11:54 am

≫ Next: Wire Encryption in Hadoop

≪ Previous: Hortonworks Data Platform 2.0 on openjdk

Last week was a busy week for shipping code, so here’s a quick recap on the new stuff to keep you busy over the holiday season.

Technical Preview of Storm. This preview includes the latest release of Storm with instructions on how to install Storm on Hortonworks Sandbox and run a sample topology to familiarize yourself with the technology. This is the final pre-Apache release of Storm. Read more…
Technical Previews of Apache Falcon. Apache Falcon provides a declarative framework for describing data pipelines to simplify the development of processing solutions. Read more…
Technical Preview of Apache Knox Gateway. The Apache community has been working to create a perimeter security solution for Hadoop. With this technical preview Apache Knox is one step closer to general availability. Read more…
Release of Apache Tez 0.2.0. Apache Tez is an application framework which allows for a complex directed-acyclic-graph of tasks for processing data, built atop Apache Hadoop YARN and a crucial part of the Stinger initiative. Read more...

As with everything we do our innovation is in the open, and so alongside the code we’ve also released companion Labs pages detailing key areas of current innovation and current status. Check them out and, as always, get involved – we’d love your contribution and feedback.

The post Downloads for Storm, Falcon, Knox Gateway and Tez appeared first on Hortonworks.

↧

Wire Encryption in Hadoop

December 17, 2013, 12:30 pm

≫ Next: Announcing Stinger Phase 3 Technical Preview

≪ Previous: Downloads for Storm, Falcon, Knox Gateway and Tez

Encryption is applied to electronic information in order to ensure its privacy and confidentiality. Typically, we think of protecting data as it rests or in motion. Wire Encryption protects the latter as data moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC.

Let’s cover the configuration required to encrypt each of these protocols. To see the step-by-step instructions please see the HDP 2.0 documentation.

RPC Encryption

The most common way for a client to interact with a Hadoop cluster is through RPC. A client connects to a Name Node(NN) over RPC protocol to read or write a file. RPC connections in Hadoop use Java’s Simple Authentication & Security Layer (SASL) which supports encryption. When hadoop.rpc.protection property is set to privacy the data over RPC is encrypted with symmetric keys. Please refer to this post for more details on hadoop.rpc.protection setting.

Data Transfer Protocol

The NN gives the client the address of the first Data Node(DN) to read or write the block. The actual data transfer between the client and a DN is over Data Transfer Protocol. To encrypt data transfer you must set dfs.encryt.data.transfer=true on NN and all DNs. The actual algorithm used for encryption can be customized with dfs.encrypt.data.transfer.algorithm set to either 3des or rc4. If nothing is set, then the default on the system is used (usually 3DES.) While 3DES is more cryptographically secure, RC4 is substantially faster.

HTTPS Encryption

Encryption over the HTTP protocol is implemented with the support for SSL across a Hadoop cluster. For example, to enable WebHDFS to listen for HTTP over SSL, you must configure SSL on the NN and all the DNs by setting dfs.https.enable=true in hdfs-site.xml. Typically SSL is configured to only authenticate the Server-this is called 1-way SSL. In addition, SSL can also be configured to authenticate the client-this is called mutual authentication or 2-way SSL. To configure 2-way SSL set dfs.client.https.need-auth=true in hdfs-site.xml on each NN and DN. For 1-way SSL only the keystore needs to be configured on the NN and DN. The keystore & the truststore configuration go in the ssl-server.xml file on the NN and each DN.

The following configuration properties need to be specified in ssl-server.xml.

Property	Default Value	Description	Needed for 1-way	Needed for 2-way
ssl.server.keystore.type	JKS	The type of the keystore, JKS = Java Keystore, the de-facto standard in Java	Y	Y
ssl.server.keystore.location	None	The location of the keystore file	Y	Y
ssl.server.keystore.password	None	The password to open the keystore file	Y	Y
ssl.server.truststore.type	JKS	The type of the trust store	N	Y
ssl.server.truststore.location	None	The location of the truststore file	N	Y
ssl.server.truststore.password	None	The password to open the truststore	N	Y

The truststore related properties are only required for 2-way SSL.

Encryption during Shuffle

Staring HDP 2.0 encryption during shuffle is supported.

The data moves between the Mappers and the Reducers over the HTTP protocol, this step is called shuffle. Reducer initiates the connection to the Mapper to ask for data and acts as SSL client. Enabling HTTPS for encrypting shuffle traffic involves the following steps.

Set mapreduce.shuffle.ssl.enabled to true in mapred-site.xml
Set keystore properties and optionally truststore (for 2-way SSL) properties mentioned in the above table.

Here is an example configuration from mapred-site.xml

<property>
    <name>hadoop.ssl.enabled</name>
    <value>true</value>
</property>
<property>
    <name>hadoop.ssl.require.client.cert</name>
    <value>false</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.ssl.hostname.verifier</name>
    <value>DEFAULT</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.ssl.keystores.factory.class</name>
    <value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.ssl.server.conf</name>
    <value>ssl-server.xml</value>
    <final>true</final>
</property>
<property>
    <name>hadoop.ssl.client.conf</name>
    <value>ssl-client.xml</value>
    <final>true</final>
</property>

The above configuration refers to a ssl-server.xml and ssl-client.xml. These files will contain properties as specified in the table above. Make sure to put ssl-server.xml and ssl-client.xml in the default ${HADOOP_CONF_DIR}.

JDBC

HiveServer2 implements encryption with Java SASL protocol’s quality of protection (QOP) setting. With this the data moving between a HiveServer2 over jdbc and a jdbc client can be encrypted. On the HiveServer2, set hive.server2.thrift.sasl.qop in hive-site.xml, and on the JDBC client specify sasl.sop as part of jdbc hive connection string. eg jdbc:hive://hostname/dbname;sasl.qop=auth-int. HIVE-4911 provides more details on this enhancement.

Closing Thoughts

Ensuring confidentiality of the data flowing in an out of a Hadoop cluster requiring configuring encryption on each channel that is being used to move the data. The blog describes encryption configuration required for encryption for various channels.

Please send me any comments about this post and any topic you would like me to cover. Stay tuned for the next post about authorization in Hadoop. And you can stay up-to-date on Security innovation in Hadoop via our Labs Page.

The post Wire Encryption in Hadoop appeared first on Hortonworks.

↧

Announcing Stinger Phase 3 Technical Preview

December 17, 2013, 6:54 pm

≫ Next: 7 Minutes on Hortonworks Data Platform v2.0

≪ Previous: Wire Encryption in Hadoop

As an early Christmas present, we’ve made a technical preview of Stinger Phase 3 available. While just a preview by moniker, the release marks a significant milestone in the transformation of Hadoop from a batch-oriented system to a data platform capable of interactive data processing at scale and delivering on the aims of the Stinger Initiative.

Apache Tez and SQL: Interactive Query-IN-Hadoop

stinger-phase-3 Tez is a low-level runtime engine not aimed directly at data analysts or data scientists. Frameworks need to be built on top of Tez to expose it to a broad audience… enter SQL and interactive query in Hadoop.

Stinger Phase 3 Preview combines the Tez execution engine with Apache Hive, Hadoop’s native SQL engine. Now, anyone who uses SQL tools in Hadoop can enjoy truly interactive data query and analysis.

We have already seen Apache Pig move to adopt Tez, and we will soon see others like Cascading do the same, unlocking many forms of interactive data processing natively in Hadoop. Tez is the technology that takes Hadoop beyond batch and into interactive, and we’re excited to see it available in a way that is easy to use and accessible to any SQL user.

Stinger Phase 3 Preview major improvements

The major improvements found in this release include:

Choose either the Tez execution engine for interactive SQL in Hadoop or the proven Map/Reduce framework for batch SQL processing.
Substantial improvements to the Vectorized Query Engine, developed in collaboration between Microsoft and Hortonworks, which increases SQL processing by an order-of-magnitude or more.
A sneak peak at expanded Hive SQL coverage including subqueries for IN / NOT IN and HAVING clauses.
Plus more than 500 other improvements covering both Hive and Tez.

An Open Hive Testbench

We wanted to make it easier for people to get a feel for the difference Tez makes, so we’ve created an Open Hive Testbench project in Github with the same test suite we use internally to test Hive on Tez. The testbench includes a data generator and 50 sample queries derived from the TPC-DS benchmark. The Testbench’s data generator lets you generate any amount of data you want, from gigabytes to terabytes, so you can get a feel for Tez at small scales or large.

Try The Preview In 3 Easy Steps

Step 1: Get an HDP 2 cluster.
- Option 1: If you want to truly experience the power of Tez, we suggest a larger dataset (200GB or more) on a cluster of at least 4 physical nodes.
- Option 2: If simple is what you need, install an HDP 2 Sandbox and try it out there. This option doesn’t give you large scale but does let you experience interactive query in Hadoop. If you go this route, increase the memory size of your Sandbox to 4GB or more.
Step 2: Download the Stinger Phase 3 Preview package, which includes Tez 0.2 and a new version of Hive designed to work with Tez.
Step 3: Follow the installation instructions to deploy the Preview to your cluster. Once you’re done, give Hive a try through CLI, beeline or through HiveServer2.

Please Provide Your Feedback

Want to tell us your experience or discuss a problem? Join the Hortonworks Hive Forum and tell us about your experience with Hive and Tez. Stay up to date with continued progress on Stinger at our Labs page.

The post Announcing Stinger Phase 3 Technical Preview appeared first on Hortonworks.

↧

7 Minutes on Hortonworks Data Platform v2.0

December 19, 2013, 9:07 am

≫ Next: How To Use Local Repositories with Apache Ambari

≪ Previous: Announcing Stinger Phase 3 Technical Preview

The year is coming to its end. Maybe you’re reading this as you race to check a few more 2013 items off of your to-do list (at work or at home). Or maybe you’ve already got a hot toddy in your hand and your feet kicked up, with slippers warming your toes.

In 2013, I have been fortunate enough to spend the year speaking with our customers and I learned about how so many important organizations are using Apache Hadoop and Hortonworks Data Platform (HDP) to solve real problems. To make their products better. To treat heart disease. To understand their customers and to serve their markets.

With that in mind, I recently sat down with Bob Page, VP of Products at Hortonworks, to discuss some of the great things that all of our customers (present and future) can expect from HDP2.

Here are some of the highlights:

Efficient new data processing workloads running side-by-side in Apache Hadoop YARN, the Hadoop operating system
Interactive queries at petabyte scale, made possible by Apache Hive v0.12 and the Stinger Initiative
Easier provisioning, monitoring and management of an HDP2 clusters with Apache Ambari v1.4.1
Remarkable reductions in mean time to recovery (MTTR) with Apache HBase improvements in version 0.96

In 2014, as always, we do Hadoop.

We’re looking forward to sharing more of these videos in the new year and to going into more detail about the innovation that is packed into HDP v2.0.

Visit our product page to learn more about HPD v2.0

To learn more about the October 15, 2013 release of Apache Hadoop v2.2.0, read the release notes.

The post 7 Minutes on Hortonworks Data Platform v2.0 appeared first on Hortonworks.

↧

How To Use Local Repositories with Apache Ambari

December 18, 2013, 12:58 pm

≫ Next: Heterogeneous Storages in HDFS

≪ Previous: 7 Minutes on Hortonworks Data Platform v2.0

The network and security teams at your company do not allow internet access from the machines where you plan to install Hadoop. What do you do? How do you install your Hadoop cluster without having access to the public software packages? Apache Ambari supports local repositories and in this post we’ll look at the configuration needed for that support.

When installing Hadoop with Ambari, there are three repositories at play: one for Ambari – which primarily hosts the Ambari Server and Ambari Agent packages) and two repositories for the Hortonworks Data Platform – which hosts the HDP Hadoop Stack packages and other related utilities.

General Steps for Building a Local Repository

Whether it’s the Ambari repository, or the HDP repositories, below we summarize two options to build a local repository. For more background, you can review this Hortonworks document that covers installing Hadoop in data centers with network restrictions. The document contains a good amount of details on building local repositories, as well as information regarding where to get the Ambari and HDP repository tarballs (if you choose Option 2 below).

Option 1: If you can get temporary internet access, you can use the public repository to build the local repository via “reposync”. Basically, you can “reposync” the packages – which means sync all of the software packages from the public repository to your local host, construct the repository by using linux tools to create the necessary repodata and host all of those packages from your apache web server to have a local repo.

Reposync the repository packages local
Construct the repository repodata for those local packages
Host from apache web server

Option 2: If you cannot get temporary internet access, you can download a repository tarball which contains all of the software packages in tarball form, extract into your apache web server for hosting and voila, you have a local repo.

Download repository tarball local
Extract software packages
Host from apache web server

Regardless of your choice above, the end result is having a local repository inside of your network that is addressable by a Base URL - a URL to the directory where the repodata directory of the repository is located.

What about the JDK?

During Ambari Server setup, Ambari will optionally download and install the JDK. The JDK is hosted publicly but if you do not have internet access, you need to download the JDK and install the JDK on your hosts. And when you run Ambari Server setup, specify the -j option to indicate the location of your JDK.

ambari-server setup -j /path/to/your/installed/jdk

Note: This is the JDK install scenario we typically see. Hosts already have a JDK installed and by using the -j option, you instruct Ambari to use that already-installed JDK instead of trying to download and install the JDK from the internet.

Installing HDP Stack with the Local Repository

For Ambari to install the Hortonworks Data Platform (HDP) Stack, you need the HDP repository available. So with the HDP Stack local repository Base URL in hand (that you created earlier), and with the Ambari Server installed + setup, start the Ambari Cluster Install Wizard.

http://your.ambari.server:8080

ambari1

Expand the Options area and you’ll see (by default) the Base URLs for the HDP Stack public repositories. Since HDP supports multiple operating systems (OS), and each set of OS packages are in their own repositories, there is a Base URL per OS.

ambari2

Based on what OS (or OSes) you plan to use in your cluster, replace the public Base URL with your local repository Base URL (that you created earlier). You can uncheck the OSes you do not plan to use in your cluster. Click Next and continue along with the cluster install process.

Ambari will validate your Ambari Server host can reach this repository and that the Base URL points to a valid repository (just in case you mistyped or misconfigured your local repository). And during host registration, if any of the hosts you plan to include in the cluster use a different OS than one(s) you specified in Advanced Repository Options, you will see a warning.

After validation, click Next and continue with your install. After you click Deploy and Ambari installs the Hadoop packages on your hosts, each host will access the local repository to obtain the packages and not go out to the internet.

That’s about it. I should point out that local repositories are not only for installing Hadoop without internet access. Local repositories can also minimize internet bandwidth usage when downloading software packages which helps make cluster installs faster. Also, by having a local repository available for a specific Stack version, you can rest assured you have software packages for a Hadoop Stack available for installs in the future. I think you’ll agree that local repositories are critical when you do not have internet access and will come in handy to help speed package installs.

Get started today using the latest Ambari release. And as always, to find out more about Ambari, please visit the Apache Ambari Project page. You can also join the Ambari User Group and attend Meetup events.

Enjoy!

The post How To Use Local Repositories with Apache Ambari appeared first on Hortonworks.

↧

Heterogeneous Storages in HDFS

January 7, 2014, 9:31 am

≫ Next: 3 Reasons to try Stinger Phase 3 Technical Preview

≪ Previous: How To Use Local Repositories with Apache Ambari

Hadoop has traditionally been used for batch processing data at large scale. Batch processing applications care more about raw sequential throughput than low-latency and hence the existing HDFS model where all attached storages are assumed to be spinning disks has worked well.

There is an increasing interest in using Hadoop for interactive query processing e.g. via Hive. Another class of applications makes use of random IO patterns e.g. HBase. Either class of application benefits from lower latency storage media. Perhaps the most interesting of the lower latency storage media are Solid State Disks (SSDs).

The high cost per Gigabyte for SSD storage relative to spinning disks makes it prohibitive to build a large scale cluster with pure SSD storage. It is desirable to have a variety of storage types and let each application choose the one that best fits its performance or cost requirements. Administrators will need mechanisms to manage a fair distribution of scarce storage resources across users. These are the scenarios that we aim to address by adding support for Heterogeneous Storages in HDFS.

Background

Let’s take a quick sidebar to review the performance characteristics of a few common storage types. If you are familiar with this topic you may skip to the next section.

Storages can be chiefly evaluated on three classes of performance metrics:

Cost per Megabyte.
Durability. This is the measure of the permanence of data once it has been successfully written to the medium. Modern hard disks are highly durable, however given a large enough collection of disks, regular disk failures are a statistical certainty. Annual Failure Rates of disks vary between 2 – 8% for 1 – 2 year old disks as observed in a large scale study of disk failure rates [1].
Performance. There are two key measures of storage performance:
1. Throughput: This is the maximum raw read/write rate that the storage can support and is typically measured in MegaBytes/second (MBps). This is the primary metric that batch processing applications care about.
2. IO operations per second: The number of IO operations per second is affected by the workload and IO size. The rotational latency of spinning disks limits the maximum IOPS for a random IO workload which can limit the performance of interactive query applications. e.g. a 7200 RPM hard disk (typical for commodity hardware) will be limited to a theoretical maximum of 240 IOPS for a purely random IO workload.

The following table summarizes the characteristics of a few common storage types based on the above metrics.

Storage Type	Throughput	Random IOPS	Data Durability	Typical Cost
HDD	High	Low	Moderate, failures can occur at any time.	4c/GB
SSD	High	High	Moderate, failures can occur at any time.	50c/GB for internal SATA SSD. roughly costs 10x or more of HDD
NAS (Network attached storage)	High	Varies	May employ RAID for high durability	Varies based on features, typically falls between HDD and SSD.
RAM	Very high	Very high	No durability, data is lost on process restart	>$10/GB roughly costs 100x or more of HDD

Design Philosophy

We approached the design with the following goals:

HDFS will not know about the performance characteristics of individual storage types. HDFS just provides a mechanism to expose storage types to applications. The only exception we make is DISK i.e. hard disk drives. This is the default fallback storage type. Even this may be made configurable in the future. As a corollary we avoid using the terms Tiered Storage or Hierarchical Storage.
Pursuant to (1), we do not plan to provide any APIs to enumerate or choose storage types based on their characteristics. Storage types will be explicitly enumerated.
Administrators must be able to limit the usage of individual storage types by user.

Changes to HDFS Storage Architecture

The NameNode and HDFS clients have historically viewed each DataNode as a single storage unit. The NameNode has not been aware of the number of storage volumes on a given DataNode and their individual storage types and capacities.

Fig1-StoragesOnDataNode

DataNodes communicate their storage state through the following types of messages:

Storage Report. A storage report contains summary information about the state of a storage including capacity and usage details. The Storage Report is contained within a Heartbeat which is sent once every few seconds by default.
Block Report. A block report is a detailed report of the individual block replicas on a given DataNode. Block reports are split into two types: a. Incremental block report sent periodically that lists the newly received and deleted blocks i.e. delta since the last incremental report; and b. Full block report sent less frequently that has a complete list of all block replicas currently on the DataNode.

Currently each DataNode sends a single storage report and a single block report containing aggregate information about all attached storages.

With Heterogeneous Storage we have changed this picture so that the DataNode exposes the types and usage statistics for each individual storage to the NameNode. This is a fundamental change to the internals of HDFS and allows the NameNode to choose not just a target DataNode when placing replicas, but also the specific storage type on each target DataNode.

Fig2-DataNodeAsCollectionOfStorages

Separating the DataNode storages in this manner will also allow scaling the DataNode to larger capacity by reducing the size of individual block reports which can be processed faster by the NameNode.

How Applications will use Heterogeneous Storages

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include: a. The desired number of file replicas (also called the replication factor) and; b. The target storage type for the replicas.

HDFS will attempt to satisfy the Storage Preference based the following factors:

Availability of quota.
Availability of storage space. Based on availability of space, some or all block replicas of the given file may be placed on the requested storage.

If the target storage type e.g. SSD is not available, then HDFS will attempt to place replicas on the fallback storage medium (Hard disk drives).

An applications can optionally specify a Storage Preference when creating a file, or it may choose to modify the Storage Preference on an existing file.

The following FileSystem API changes will be exposed to allow applications to manipulate Storage Preferences:

FileSystem#create will optionally accept Storage Preference for the new file.
FileSystem#setStoragePreference to change the Storage Preference of a single file. Replaces any existing Storage Preference on the file.
FileSystem#getStoragePreference to query the Storage Preference for a given file.

Changing the Storage Preference of an existing file will initiate the migration of all existing file blocks to the new target storage. The call will return success or failure depending on the quota availability (more on quotas in the next section). The actual migration may take a long time depending on the size of the file. An application can query the current distribution of the block replicas of the file using DFSClient#getBlockLocations.

The API will be documented in detail on HDFS-5682.

Quota Management Improvements

Quota is a hard limit on the disk space that may be consumed by all the files under a given directory tree. Quotas can be set by the administrator to restrict the space consumption of specific users by applying limits on their home directory or on specific directories shared by users. Disk space quota is deducted based on the number of replicas. Thus if a 1GB file is configured to have three block replicas, the total quota consumed by the file will be 3GB.

Disk space quota is assumed to be unlimited when not configured for a given directory. Quotas are checked recursively starting at a given directory and walking up to the root. The effective quota of any directory is the minimum of (Directory quota, Parent quota, Grandparent quota, … , Root quota). An interesting property of disk space quota is that it can be reduced by the administrator to be something less than the combined disk usage under the directory tree. This leaves the directory in an indefinite Quota Violation state unless one or more replicas are deleted.

We will extend the existing Quota scheme to add a per storage-type quota for each directory. For a given directory, if its parent does not specify any per-type quota, then the per-type quota of the directory applies. However if the parent does specify a per-type quota, then the minimum of the (parent, subdirectory) applies. If the parent explicitly specifies a per-type quota of zero, then the children cannot use anything. This property can be used by the administrator to prevent creating files on SSD under /tmp, for example.

Implementation Status

Given the scope of the changes we have chosen to implement the feature in two principal phases. The first phase adds support for exposing the DataNode as a collection of storages. This support is currently available in trunk and is planned to be merged into the Apache branch-2 so that it will be available in the Apache Hadoop 2.4 release.

The second phase will add API support for applications to make use of Storage Types and is planned to align with the 2.5 release time frame.

References

Eduardo Pinheiro et. al. 2007. Failure Trends in a Large Disk Drive Population.
Design document – ‘Heterogeneous Storage for HDFS‘.
HDFS-2832. Enable support for heterogeneous storages in HDFS.
HDFS-5682. Heterogeneous Storage phase 2 – APIs to expose Storage Types.

The post Heterogeneous Storages in HDFS appeared first on Hortonworks.

↧