Hadoop – Hortonworks

Whether you were busy finishing up last minute Christmas shopping or just taking time off for the holidays, you might have missed that Hortonworks released the Stinger Phase 3 Technical Preview back in December. The Stinger Initiative is Hortonworks’ open roadmap to making Hive 100x faster while adding standard SQL. Here we’ll discuss 3 great reasons to give Stinger Phase 3 Preview a try to start off the new year.

Reason 1: It’s The Fastest Hive Yet

Whether you want to process more data or lower your time-to-insight, the benefits of a faster Hive speak for themselves. Stinger Phase 3 brings 3 key new components into Hive that lead to a massive speed boost.

Component	Benefit
Tez	A modern implementation of Map/Reduce. New Tez operators simplify and accelerate complex data processing natively in Hadoop.
Tez Service	Maintains warm containers and caches key information to allow fast query launch.
Vectorized Query	New execution engine that takes advantage of modern hardware architectures to accelerate computations of data in memory up to 100x.

What does it add up to? To find out we compared Stinger Phase 3 Preview head-to-head against Hive 12 on the same hardware and over the same dataset.

stinger-p3-1

In this broad-based benchmark including both large reporting type queries as well as more targeted drill-down queries, Stinger Technical Preview shows an average 2.7x speedup versus Hive 12. Remember that Hive 12 includes all the performance benefits that have gone into Stinger Phases 1 and 2, and is the fastest Hive generally available today.

We also did some limited comparisons between Hive on Tez and Hive 10. Hive 10 pre-dates the Stinger initiative and its focus on improving Hive performance.

stinger-p3-2

In this limited subset of queries we see speedups ranging from 5x to 40x going from Hive 10 to Hive on Tez.

Configuration Details

Hardware:	Software:
20 physical nodes, each with: 4x 2.3GHz Xeon E5-2630 for total of 24 cores per node. 64GB RAM. 6 1TB drives per node. 1 Gigabit interconnect between the nodes.	Hadoop 2.3.0-SNAPSHOT Tez 0.2 Hive 0.13 Snapshot taken from Stinger Technical Preview. Hive 0.12 taken from Hortonworks HDP 2.0 GA. Hive 0.10 built manually against Hadoop 0.23. (The GA HDP package is not compatible with Hadoop 2). Configuration settings for Hive 12 and Hive 13 were the same as those found in the Stinger Preview Quickstart. Hive 10 settings were those found in Hortonworks HDP 1.2.
Data:	Queries:
TPC-DS Scale 200 data, partitioned by day. Hive 12 and Stinger Preview were run against data stored in ORCFile using all default settings. Hive 10 was run against text data because it doesn’t support ORCFile.	Queries were those published in the Hive Testbench. Queries were run as they appear in the testbench and not individually tuned. The data generator we use is also included in the Hive Testbench.

Reason 2: Hive is now Interactive

Stinger Phase 3 Preview introduces the Tez Service, a persistent service that runs as a YARN Application Master. The Tez Service’s job is to facilitate fast query launch, and does this in two ways: First, the Tez Service keeps hot containers on standby to ensure fast query launch.

Second, the Tez Service caches key information such as split calculations. Any time data in Hadoop is processed, maps are assigned to splits of files on the filesystem in order to divide-and-conquer the work. This involves querying the NameNode to identify where the data is physically located and can take several seconds for large datasets. Because Tez Service caches this data, subsequent queries over the same data launch much faster.

stinger-p3-3

Let’s take a look at a few examples of how the Tez Service helps.

Query	Tez Cold (s)	Tez Warm (s)	Speedup from Tez Service (s)
query27	24.3	8.8	15.5
query79	80.9	45.2	35.8

Some example speedups using Tez Service.

Query 27 is a simple star-schema join involving one fact table and many dimension tables. When Tez Service has cached data and has warm containers, time to execute falls by more than 50% to under 10 seconds, which many people regard as the bar for “interactive query”.

Query 79 is a more complex fact-to-fact join that addresses much more data. Because more data is addressed, caching benefits the query even more, saving more than 30 seconds.

stinger-p3-4

In the results, Hot queries ran an average of 17 seconds faster than cold queries. This is a big deal for queries smaller, interactive queries because now Hive is able to run queries in less than 10 seconds over large datasets, enabling interactive query in Hadoop.

Reason 3: Hive is 100% Community Open Source

At Hortonworks we spend a lot of time talking about Hive but it’s important to remember that Hive is a community effort and represents the hard work of hundreds of individuals who either contribute privately or represent one of more than 10 companies that contribute to Hive. Through this collective effort, Hive is quickly becoming the most robust, mature and secure SQL solution for Hadoop. Apache Hive is the only SQL solution for Hadoop supported by every major Hadoop distribution. Choosing Hive means 100% Community Open Source and 0% lock-in.

Try It For Yourself

We hope you’ll try the Stinger Phase 3 Preview for yourself. All you need is an HDP 2.0 cluster or Sandbox. To get started, follow the instructions on the announcement blog post. As always, if you have questions or need help, head to the Hortonworks Forums for tips and advice.

The post 3 Reasons to try Stinger Phase 3 Technical Preview appeared first on Hortonworks.

This is the third in our series on modern data architectures across industry verticals. Others in the series are:

Many of the world’s largest telecommunications companies use Hortonworks Data Platform (HDP) to manage their data. Through partnership with these companies, we have learned how our customers use HDP to improve customer satisfaction, make better infrastructure investments and develop new products.

Hortonworks partner Teradata recently gave some use case examples in this video about how Verizon Wireless uses Teradata in combination with Hortonworks Data Platform to keep their customer churn below 1%.

Rob Smith, Verizon Wireless’ Executive Director for IT, describes how his team uses their discovery platform to improve customer interactions, by:

Identifying better ways to communicate with customers about their payments
Analyzing social media to better understanding customer sentiments about Verizon policy changes
Tailoring marketing communications to each customer’s individual needs

Smith describes how this new customer insight helps their IT and Marketing teams align business objectives (for the benefit of Verizon customers).

Deliver a Modern Data Architecture with Hadoop

Our other telco clients have identified their own Hadoop use cases, but there are similar patterns in the Hadoop data architectures that they all build. Those data architectures allow telcos to store new types of data, retain that data longer, and join diverse datasets together to derive new insight.

The following reference architecture diagram represents an amalgam of those approaches that we see across our telco clients.

telco-hadoop-architecture

With their Hadoop modern data architectures, telecommunications companies of all sorts can execute the following six use cases (and many more).

Analyze Call Detail Records (CDRs)

Telcos perform forensics on dropped calls and poor sound quality, but call detail records flow in at a rate of millions per second. This high volume makes pattern recognition and root cause analysis difficult, and often those need to happen in real-time, while the customer waits on the line listening to hold music. Delay causes attrition and harms servicing margins.

Apache Flume can ingest millions of CDRs per second into Hadoop, while Apache Storm processes those in real-time and identifies any troubling patterns.

HDP facilitates long-term data retention for root cause analysis, even years after the first issue. This CDR analysis can be used to continuously improve call quality, customer satisfaction and servicing margins.

Service Equipment Proactively

Transmission towers and their related connections form the spinal chord of a telecommunications network. Failure of a transmission tower can cause service degradation, and replacement of equipment is usually more expensive than repair. There exists and optimal schedule for maintenance: not too early, nor too late.

Apache Hadoop stores unstructured, streaming, sensor data from the network. Telcos can derive optimal maintenance schedules by comparing real-time information with historical data. Machine learning algorithms can reduce both maintenance costs and service disruptions by fixing equipment before it breaks.

Rationalize Infrastructure Investments

Telecom marketing and capacity planning are correlated. Consumption of bandwidth and services can be out of sync with plans for new towers and transmission lines. This mismatch between infrastructure investments and the actual return on investment puts revenue at risk.

Network log data helps telcos understand service consumption in a particular state, county or neighborhood. They can then analyze network loads more intelligently, with data stretching over longer periods of time. This allows executives to plan infrastructure investments with more precision and confidence.

Recommend Next Products to Buy (NPTB)

Telecom product portfolios are complex. Many cross-sell opportunities exist for the installed customer base, and sales associates use in-person or phone conversations to guess about NPTB recommendations, with little data to support their recommendations.

HDP gives telco sales people the ability to make confident NPTB recommendations, based on data from all of its customers. Confident NPTB recommendations empower sales associates (or self service) and improve customer interactions.

A Hadoop data lake reduces sales friction and creates NPTB competitive advantage similar to Amazon’s advantages in eCommerce.

Allocate Bandwidth in Real-time

Certain applications hog bandwidth and can reduce service quality for others accessing the network. Network administrators cannot foresee the launch of new hyper-popular apps that cause spikes in bandwidth consumption and slow network performance. Operators must respond to bandwidth spikes quickly, to reallocate resources and maintain SLAs.

Streaming data in Hadoop helps network operators visualize spikes in call center data and nimbly throttle bandwidth. Text-based sentiment analysis on call center notes can also help understand how these spikes impact customer experience. This insight helps maintain service quality and customer satisfaction, and also informs strategic planning to build smarter networks.

Develop New Products

Mobile devices produce huge amounts of data about how, why, when and where they are used. This data is extremely valuable for product managers, but its volume and variety make it difficult to ingest, store and analyze at scale. Not all data is stored for conversion into business info. Even the data that is stored may not be retained for its entire useful life.

Apache Hadoop stores more data for longer, economically. This puts rich product-use data in the hands of product managers, which speeds product innovation. It can capture product insight specific to local geos and customer segments. Immediate big data feedback on product launches allows PMs to rescue failures and maximize blockbusters.

Find Out More

These are only five of the more common Hadoop use cases for telcos. There are many more that involve combining sensor and geo-location data with structured data stores already in place.

Find out more about a modern data architecture here.

Watch our blog in the coming weeks for reference architectures in other industries.

The post Modern Telecom Architectures Built with Hadoop appeared first on Hortonworks.

One aspect of community development of Apache Hadoop is the way that everyone working on Hadoop -full time, part time, vendors, users and even some researchers all collaborate together in the open. This developed is based on publicly accessible project tools: Apache Subversion for revision control, Apache Maven for the builds; Jenkins for automating those builds and tests. Central to a lot of work is the Apache JIRA server, an instance of Atlassian’s issue management tool.

If you are a Hadoop developer, you spend a lot of time with web browser tabs pointed at JIRA issues. As an example I’m keeping an eye on, YARN-896 and YARN-1489; new features being added to YARN to aid running long-lived applications in a Hadoop 2 cluster.

You also get to issues filed by others ending up in your inbox by way of subscriptions to the hadoop developer mailing lists: anyone has the right to create a JIRA account, file issue reports, and even supply patches to the source code.

Here’s a video I’ve made, and some slides, on how to do that – and in particular – how not to:

A theme I repeat in it is that JIRA is not a place to ask for help. If you have a support subscription with Hortonworks, you should report problems via our support portal as this lets us track the problem and – escalating as need be – any issue which does need a fix in Hadoop’s code will have a public JIRA filed against it, a patch developed in the open. There’s also our community forums to discuss HDP-specific issues.

Others will have a similar stance, even more so if their Big Data stacks include closed-source components such as filesystems, job schedulers or management tools. Issues in closed source components would – naturally – have to be taken up directly with the vendor.

The underlying Apache projects do welcome public filing of bug reports – provided they are about real bugs in the applications, and if they come with enough information to make it possible to identify root causes. They also welcome people supplying fixes to those bugs – patches containing source code, including tests. That’s a topic I plan to cover in another video.

The post “Help, My Hadoop Doesn’t Work” appeared first on Hortonworks.

I recently sat down with Owen O’Malley and Carter Shanklin to discuss the dramatic improvements delivered by the Stinger Initiative to version 0.12 of Apache Hive, which is well on its way to being 100x faster than pre-Stinger versions of Hive. That means interactive queries on petabytes of data.

Owen is one of the original architects of Apache Hadoop and Carter is the Hortonworks product manager focused on Hive. Together, they explain the speed, scale and SQL semantics delivered in Apache Hive v0.12, which is included in Hortonworks Data Platform v2.0. You can also find a technical preview of Hive 13 on our Labs page.

There’s also a little bit of Apache Hadoop YARN woven in.

Highlights include:

Basic definitions for Apache Hive, Apache Tez, the ORCFile format, predicate pushdown, vectorization and the Stinger Initiative
Discussion of new features in Hive 12
Addition of the VARCHAR and DATE data types
Preview of Hive 13 and phase three of Stinger

Visit our Stinger Initiative labs page to learn more.

The post Speed, Scale and SQL: The Stinger Initiative, Apache Hive 12 & Apache Tez appeared first on Hortonworks.

This guest post from Eric Hanson, Principal Software Development Engineer on Microsoft HDInsight, and Apache Hive committer.

Hive has a substantial community of developers behind it, including a few from the Microsoft HDInsight team. We’ve been contributing to the Stinger initiative since it was started early in 2013, and have been contributing to Hadoop since October of 2011. It’s a good time to step back and see the progress that’s been made on Apache Hive since fall of 2012, and ponder what’s ahead.

Hive has a lot going for it with respect to both functionality and scalability. The external table model of Hive, input adaptors for many file formats, and the on-by-default UDF support and large base of Java code that can be applied in UDFs make it very attractive for data transformation applications. For non-traditional analysis, the ability to embed custom Java mappers and reducers inside Hive SQL queries is also quite useful. Hive’s SQL language coverage has expanded to include much of SQL-92 and some SQL-99 OLAP extensions. And it scales to thousands of nodes because of its integration with Hadoop and HDFS. But it’s been criticized for being slow – more specifically for having a slow inner loop that used to process rows on the order of 100X slower than a state-of-the-art query executer. Hive has been a favorite whipping boy when it comes to performance. Look around and it’s not hard to find statements like “Our <database-or-big-data-system-name> can run SQL queries <number> times faster than Hive.”

This is changing. Over the last 15 months or so, the following big things have happened with Hive to improve performance:

ORC: ORC is a high-quality columnstore that can compress data around 10 times or more. It has the columnstore virtues people have come to expect, such as good compression and the need to only read columns your query touches.
Vectorized query execution: Hive now supports vectorized query execution. This technology can reduce CPU costs in the inner loop of query execution by 10X or more. It was popularized by the MonetDB/X100 project and has made its way into the top-performing data warehouse (DW) DBMSs, including Microsoft SQL Server and PDW. Vectorized query execution is a once-in-a-generation technological breakthrough. Any DW DBMS (proprietary or open source, integrated with Big Data or not) that does not have it is not in the game. You wouldn’t sell an airliner with piston engines today would you?
Tez: Hive on Tez reduces data communication costs between nodes and query phases tremendously by allowing more general data flows and reducing spooling to disk.
Container re-use: This allows processes to be re-used in the same query or user session, reducing start-up overhead. It’s a little like multi-threading. I’ve seen results that allow queries to finish in under 10 seconds, down from over 20 seconds, with this turned on, on a 20 node cluster.

What this means is that you need to verify statements of the form “<systemname> is <X> times faster than Hive” carefully because the code in the Hive trunk today is an order of magnitude faster (sometimes more) than it was 15 months ago. Here’s an example from Hortonworks. The left bar is Hive 10, the middle bar is Hive 11 with ORC, and the right is the latest Hive trunk. These results are at scale factor 20 (approximately 200GB of data).

stmsft

As you can see, for this query, Hive has moved from the “I’ll go for coffee while I run this query” stage to the “I don’t mind waiting for my answer” stage.

Even with this progress, Hive still has room for improvement. The biggest things it’s missing from a query execution performance perspective are:

the ability to run tasks with low startup overhead using threads rather than heavy-weight processes
an in-memory data cache or buffer pool to reduce or eliminate I/O

Hive is already attractive because of its functionality, ability to scale, established community and user base, and open source distribution. When the enhancements of the last 15 months get into production, its performance on a per-node basis won’t be too bad. Add in light weight scheduling and in-memory caching, and it can be downright good. Then Hive will be poised to grab the whip away and hit back.

The post Update on Stinger: the view from a Microsoft Committer appeared first on Hortonworks.

We are excited to announce that the Hortonworks Data Platform 2.0 for Windows is publicly available for download. HDP 2 for Windows is the only Apache Hadoop 2.0 based platform that is certified for production usage on Windows Server 2008 R2 and Windows Server 2012 R2.

With this release, the latest in community innovation on Apache Hadoop is now available across all major Operating Systems. HDP 2.0 provides Hadoop coverage for more than 99% of the enterprises in the world, offering the most flexible deployment options from On-Premise to a variety of cloud solutions.

Unleashing YARN and Hadoop 2 on Windows

HDP 2.0 for Windows is a leap forward as it brings the power of Apache Hadoop YARN to Windows. YARN enables a user to interact with all data in multiple ways simultaneously – for instance making use of both realtime and batch processing – making Hadoop a true multi-use data platform and allowing it to take its place in a modern data architecture.

Windows data centers can now depend on a Highly Available NameNode to automatically detect and recover from any hardware, operating system or JVM faults and deliver reliable access to data to all HDP processing components.

Building on top of the next generation architecture of core Apache Hadoop, HDP 2.0 for Windows also provides:

Phase 2 of the Stinger initiative – providing a significant increase in SQL semantics, adding the VARCHAR and DATE datatypes and improving performance ORDER by and GROUP by.
Apache HBase 0.96 – delivering important enterprise features such as Snapshots and improved mean time to recovery (MTTR)
The latest stable releases of the rest of the HDP stack, integrated and certified to run on top of YARN

Upgrading from HDP 1.3?

As you move to HDP 2.0 for Windows, we’ve ensured that you have an upgrade path from HDP 1.3 for Windows clusters. You can upgrade your cluster in place, maintaining your data as well as ensuring that your existing Hive, Pig and MapReduce jobs will run on HDP 2.0 for Windows.

New to Hadoop and HDP?

For those looking to evaluate HDP 2.0 for Windows, we provide an easy to use Single node install through a graphical MSI interface. This is a great way to explore all the new features and build new applications for Hadoop.

You can also use the Hortonworks Sandbox along with Microsoft Power BI through a new tutorial for the Hortonworks Sandbox.

Microsoft’s Commitment to Apache Hadoop

Microsoft has a deep engineering relationship with Hortonworks as we have the same objectives to ensure Hadoop is the best enterprise data platform it could be. Read about our work with Microsoft in the Hadoop open source community here.

Additionally you can hear how Hadoop and Microsoft fit into a modern data architecture (MDA) through this webinar.

You can join Microsoft, Hortonworks and Elastacloud at an upcoming hackathon on Feb 8th and 9th at Microsoft Campus, Mountain View, CA

Get started today! Download here.

The post HDP 2.0 for Windows is GA appeared first on Hortonworks.

Today, we are excited to announce the agenda for Hadoop Summit Europe 2014. We welcome you to check it out now and hopefully start planning you trip to Amsterdam now!

The call for abstracts for Hadoop Summit Europe was open for just over two months and we received an unbelievable 354 submissions. Wow! Further, as we read through them, the quality was amazing. We quickly surmised that the show was going to be great, but the selection process was going to rough.

For each of our five tracks we recruited five amazing chairs (listed below & blog post forthcoming) whose job was to curate the content. While the task was daunting we are delighted with the outcomes they and their committees chose.

Our tracks for Hadoop Summit Europe 2014 include the following:

Committer Track (led by Steve Loughran)

Come hear it from the elephant’s mouth… This track presents technical deep dive content from committers across a wide range of advanced/basic topics and projects. Speakers in this track are restricted to committers across all Hadoop-related Apache projects only and content will be curated by a group of senior committers.

The Future of Apache Hadoop (led by Evert Lammerts)

The next generation of Hadoop is being built today. This track investigates the key projects, such as YARN, the incubation projects and the industry initiatives driving innovation in and around the Hadoop platform. Attendees will hear from the technical leads, committers, and expert users who are actively driving the roadmaps, key features, and advanced technology research around what is coming next for the Hadoop ecosystem.

Data Science & Hadoop (led by Edd Dumbill)

Sessions in this track focus on the practice of data science using Hadoop. This includes applications, tools, and algorithms as well as areas of advanced research and emerging applications that use and extend the Hadoop platform for data science. Sessions will cover examples of innovative analytic applications and systems that refine raw data into actionable insight using visualization, statistics and machine learning. Case studies and tips for effective exploration of business data with visualization and statistical models will also be covered.

Hadoop Deployment & Operations (led by Cedric Carbone)

This track focuses on the deployment, operation and administration of Hadoop clusters at scale, with an emphasis on tips, tricks, best practices and war stories. Sessions will cover the full deployment lifecycle from installation, configuration, and initial production deployment, error recover, security and fault tolerance for large-scale operations.

Hadoop for Business Applications and Development (led by Gary Richardson)

Hadoop is central to the modern data architecture and must integrate with the tools and applications you already use or empower new applications to be built on top of it. Speakers in this track will discuss languages, tools, techniques, and solutions for deriving business value and competitive advantage from the large volumes of data flowing through today’s enterprise.

Register before Jan 31st to take advantage of the Early Bird rates.

The post Hadoop Summit Europe 2014 Agenda Announced appeared first on Hortonworks.

This guest post from Simon Elliston Ball, Head of Big Data at Red Gate and all round top bloke.

Hadoop is a great place to keep a lot of data. The data-lake, the data-hub and the data platform; it’s all about the data. So how do you manage that data? How do you get data in? How do you get results out? How do you get at the logs buried somewhere deep in HDFS?

At Red Gate we have been working on some query tools for Hadoop for a while and while testing we found ourselves endlessly typing hadoop fs. Getting data sets from our Windows desktops, to the cluster, or inspecting job output files was just taking too many steps. It should be as easy for us to access files on HDFS as files on my local drive. So we created HDFS Explorer, which works just like Windows Explorer, but connects to the WebHDFS APIs so we can browse files on our clusters.

HDFS Explorer helps if you’re shunting smaller data sets, or results to and from your desktop, but we also found it worked great for clearing up after all those test queries. Every job submission you send will leave a trail of meta-data, logs, job files, and output, which can quickly add up to a decent amount of disk. If you’re experimenting with a sandbox implementation this can be a real issue. On a proper cluster, even if disk space is not a problem, the mess left behind can make it hard to get to the right job’s diagnostics.

This video shows how to get up and running with HDFS Explorer and Hortonworks Sandbox so you can manage files and even clean up after yourself.

From its beginning as a humble little test tool, we’ve found HDFS Explorer has opened up the Hadoop File System, and made it much easier for us to implement proper data file management. This week we’re also happy to announce that we’ve added Kerberos support, making it ready for use on clusters in an Enterprise Authentication environment.

HDFS Explorer is available for free from the Red Gate Big Data site.

Download the Hortonworks Sandbox and get started with these great tutorials.

The post The Windows Explorer experience for HDFS appeared first on Hortonworks.

I recently sat down with Devaraj Das and Carter Shanklin to discuss the dramatic improvements delivered in Apache HBase version 0.96 included in HDP 2.0.

Now HBase runs on Windows and (whether on Linux or Windows) it recovers from failures much more quickly, with dramatic improvements in mean time to recovery (MTTR).

Devaraj is one of the original architects of Apache Hadoop and Carter is the Hortonworks product manager focused on HBase. Together, they explain their collaboration with Microsoft to bring HBase to HDP 2.0 for Windows.

Other highlights include:

A basic definition of HBase and description of its architecture
Enterprise use cases for HBase
New data snapshot capabilities
Compaction improvements for better performance
First class data types for ease in development
Wire compatibility for future HBase versions
A glimpse into the HBase roadmap

Visit our Apache HBase and Microsoft lab pages to learn more.

The post HBase 0.96: HBase on Windows, and Improvements to MTTR appeared first on Hortonworks.

This guest blog post is from Syncsort, a Hortonworks Technology Partner and certified on HDP 2.0, by Keith Kohl, Director, Product Management, Syncsort (@keithkohl)

Several years ago, Syncsort set on a journey to contribute to the Apache Hadoop projects to open and extend Hadoop, and specifically the MapReduce processing framework. One of the contributions was to open the sort – both map side sort and reduce side – and to make it pluggable. This not only allows other sorts to be inserted, but it also allows sort to be avoided.

With the General Availability of Hadoop 2, pluggable sort is now a reality for all Hadoop 2-based distributions. With the GA of the Hortonworks Data Platform 2.0 (HDP 2.0), Syncsort is announcing that we are extending our partnership with Hortonworks, including support and certification of HDP 2.0 with YARN.

So what is YARN and why is it important? There is a lot of information on YARN on the Hortonworks web site, but in essence it separates the processing components (for instance MapReduce) and the resource management. It also enables a broader set of use cases for Hadoop and data stored in HDFS beyond MapReduce. But MapReduce is still there and sits on top of YARN.

I heard a quote the other day that really made me think about the experiences I hear from our customers and partners: 2013 was the year companies tried to find budget for Hadoop, 2014 is the year they ARE budgeting for Hadoop projects. But what are people doing with Hadoop and HDP?

ETL is a common use case for Hadoop, even though most people don’t even know they are doing “ETL”. Some call it data refinement or data management. At the end of the day, it’s ETL. If you’re joining data together then to do some grouping, counting, averaging, etc. – that’s aggregations. Yup: ETL. If you’re processing web logs to understand users’ behavior on your web site: ETL. Some estimate 40% to up to 70% of Hadoop use cases today is ETL.

I’m personally excited about our relationship because of what our combined offering can bring to organizations. With our support (and certification) now we can not only bring acceleration for sort to HDP 2.0 applications, our DMX-h offering also delivers an easy to use graphical interface with a full functional ETL tool running natively in MapReduce on HDP 2.0 on YARN. And, BTW, that means no code generation. No Java. No HiveQL. No Pig. Yes, ETL.

Hortonworks did something pretty cool by providing users with a VM of a completely installed version of HDP call the Hortonworks Sandbox. A DMX-h test drive is now available in Hortonworks Sandbox. We also include some sample job templates – like the use cases above – and sample data.

Find out more about the Hortonworks and Syncsort partnership.

The post Why is Hadoop 2 and YARN such a big deal for Syncsort? appeared first on Hortonworks.

We’re kicking off 2014 with an evolution to our Modern Data Architecture webinar series. Last year we focused on how your existing technologies integrate with Apache Hadoop. This year we will focus on use cases for how Hadoop and your existing technologies are being used to get real value in the enterprise. Join Hortonworks, along with Microsoft, Actian, Splunk and others as we continue our journey on delivering Apache Hadoop as an Enterprise Data Platform.

Missed the previous series? Sign up for the On-Demand Experience with Teradata, Platfora, Informatica, Rackspace, SAS, Tableau, Microstrategy, WANDisco and Revolution Analytics.

Building a Hybrid Modern Data Architecture with Apache Hadoop and Microsoft

Thursday, February 6, 10am PT

Attend this webinar and learn how you can build a hybrid Modern Data Architecture using Apache Hadoop within a Microsoft Windows or Azure (Cloud) environment. See how you can seamlessly combine your traditional data warehouse with the new world Hadoop (by leveraging an integrated query model). And how to leverage your existing Microsoft properties such as Windows, System Center, SQL Server, Parallel Data Warehouse, Excel to integrate with Hadoop.

Sign Up Now »

Unlock Big Data Analytics Value for Enterprise With Actian and Hortonworks

Tuesday, February 18, 10am PT

Join Hortonworks and Actian, as we address the challenges faced by companies trying to implement their Big Data Strategy. In this webinar, we will identify some of the top challenges around analytics with big data and highlight can use their existing skills to solve these challenges. Additionally, we will also provide real-world use cases on the integration between HortonworksData Platform and ParAccel Platform, which aims to simplify the delivery of data services for entire ecosystems of users, and significantly lower the total cost of ownership for the moderndata platform.

Sign Up Now »

Enrich a 360-degree Customer View with Apache Hadoop and Splunk

Wednesday, February 26, 10am PT
What if your organization could obtain a 360 degree view of the customer across offline, online and social and mobile channels? Attend this webinar with Splunk and Hortonworks and see examples of how Marketers, business and operations analysts can reach across disparate data sets in Hadoop to spot new opportunities for up-sell and cross-sell. We’ll also cover examples of how to measure buyer sentiment and changes in buy behavior. Along with best practices on how to use data in Hadoop with Splunk to assign customer influence scores that online, call-center and retail branches can use to customize more compelling products and promotions.
Sign Up Now »

Check back here for new webinars in the series and for videos and slides from the past events.

The post Modern Data Architecture Applied with Hadoop appeared first on Hortonworks.

I recently sat down with Himanshu Bari to discuss how Apache Ambari will serve as the single point of management for Hadoop 2 clusters integrated with Apache Storm and its real-time, streaming event processing.

Himanshu discusses Apache Storm’s five key benefits and how those will add to the power and stability of a Hadoop 2 stack, providing analysis of huge data flows from the second data is created and then for decades of historical analysis of that data stored in HDFS.

Other highlights include:

The reasons for adding Apache Storm to Hortonworks Data Platform
How Apache Hadoop YARN opened the door for integration of Storm into Hadoop
Two general use case patterns with Storm and specific uses in transportation and advertising
How Ambari will provide a single view and common operational platform for enterprises to distribute cluster resources across different workloads

Visit our Apache Storm labs page or our Apache Ambari project page to learn more.

The post Apache Storm: Real-Time Processing in Hadoop appeared first on Hortonworks.

On Feb 8th and 9th, Hortonworks, Microsoft and Elastacloud will be hosting a hackathon at the Microsoft Campus in Mountain View, CA. Whether you’re a newbie or ninja, developer or scientist, we’d love to see you there. Register here.

The focus of the hackathon will be city datasets. For instance, we’ll be drawing on datasets from San Francisco that will measure things like:

Pedestrian safety: where accidents occur, how they occur and who has caused them.
Parking data: where cars have parked in the City and for how long
Graffiti data: Where graffiti has occurred in the City

These master datasets can then be used in conjunction with anything from other intracity datasets to other city and country data to determine and prove a hypothesis. GIS information is available allowing map plots to be built around visualizing the data.

As an example, one potential use of the parking data is to determine the best potential permutation of occupancy for the parking meters vs price of the parking or detailing a dynamic pricing scheme based on seasonality or time of day.

We’ll be on hand to offer advice, technical expertise across Hadoop, HDP and HDInsight.

Thanks to our friends at Microsoft and Elastacloud for their partnership.

Elastacloud are a Big Data consultancy and ISV based in London, UK. They are experts in Windows Azure, HDInsight, HDP and machine learning and have architected and developed some of the most high profile big data on Windows Azure projects across ecommerce and gaming. . They can be contacted on info@elastacloud.com.

The post Do You Hadoop? Join us for a Hackathon on Feb 8th/9th appeared first on Hortonworks.

Apache Accumulo is gaining momentum in markets such as government, financial services and health care for its enhanced security and performance. Hortonworks has a long history with this technology and has multiple committers to the Accumulo project on staff – at least one of whom literally helped to write the book on Accumulo. This has enabled Hortonworks to provide enterprise support for Accumulo within the Hortonworks Data Platform for some time now. For those interested, more specifics can be found in our support datasheet.

Since many users have very advanced requirements when working with Accumulo, we often work closely with Sqrrl, who have built extensions to Accumulo adding enterprise-grade functionality that have wide appeal. Here’s what Ely Kahn, vp of business development at Sqrrl has to say.

Cell-Level Security for Health Care

Regulated industries such as health care are very interested in security features due to HIPAA compliance demands and data requirements that stem from the Affordable Care Act. Health care providers are investigating solutions that Sqrrl offers because of the cell-level security and access control that allow them to do new things like data sharing while maintaining compliancy, and data encryption, both at rest and in motion.

These same features have captured the interest of the financial and telecommunication sectors as well. Due to some added analytic capabilities in Sqrrl Enterprise, Hortonworks and Sqrrl partnered recently to develop a “Big Data Security Analytics application” for a large telecommunications provider.

Big Data Security Analytics for Telecom

ESG analyst Jon Olstik qualifies Big Data Security Analytics with the following:

“Security and IT operations tools spit out an avalanche of data like logs, events, packets, flow data, asset data, configuration data, and assortment of other things on a daily basis. Security professionals need to be able to access and analyze this data in real-time in order to mitigate risk, detect incidents, and respond to breaches. These tasks have come to the point where they are difficult to process using on-hand data management tools or traditional (security) data processing applications.”

This Big Data Security Analytics application enables the telecommunications provider to analyze massive amounts of cybersecurity data in real-time. This is data that would be too burdensome and expensive to analyze with traditional tools. The joint Sqrrl/Hortonworks solution relies on Sqrrl’s real-time NoSQL capabilities and Hortonworks’ enterprise-grade distribution, Hortonworks Data Platform (HDP).

Beyond Accumulo

Like many open source software solutions (Hadoop included), Accumulo can be challenging to get started for many users. Sqrrl provides added software to help would-be Accumulo downloaders with a user experience expected in enterprise level software.

Sqrrl provides critical extensions to Accumulo that include:

Secure search which enables integration of Accumulo’s cell-level security capabilities with a variety of search capabilities, such as:
- Secure SQL search to enable real-time aggregations of multi-structured data
- Secure full-text search, using the Lucene syntax to enable keyword search
- Secure graph search, to enable exploration of how data is connected
- JSON support, to enable development of document-style data models
- High concurrency to power applications supporting large numbers of users
- A policy engine and labeling engine to simplify the application of fine-grained security labels to datasets and to enable both Attribute Based and Role Based Access Controls.

HDP and Sqrrl Enterprise in the Modern Data Architecture

Sqrrl Enterprise is powered by Accumulo and utilizes the Hadoop Distributed File System for storage of multi-structured data. The integration between Sqrrl Enterprise and the Hortonworks Data Platform is seamless and enables organizations to take advantage of the advanced features available in Accumulo. Here is how Sqrrl Enterprise fits in the Modern Data Architecture with HDP:

For a more detailed Sqrrl/Hortonworks reference architecture, you can download it here.

The post Extending Apache Accumulo Support in Hadoop with Hortonworks HDP and Sqrrl appeared first on Hortonworks.

Xplenty is a Hortonworks Technology Partner offering Hadoop as a service. We invited Yaniv Mor, Co-founder and CEO of Xplenty to be our guest blogger today sharing his views on HDP and Hadoop.

We founded Xplenty to make Apache Hadoop easier. A lot easier. We believe Hadoop’s big data revolution should be available for companies of all sizes and intuitive for everyone to use. Whether it’s designing dataflows, setting up clusters, or managing and monitoring them, our platform as a service makes it happen, codefree.

Hadoop should be easier for us too here at Xplenty. Installing, maintaining, and upgrading Hadoop are no easy feats. We have skilled ninjas to tame the elephant and we want the best platform and tools to help them. That’s why we use the Hortonworks Data Platform.

We really love that HDP is 100% open source. Since our product relies on a Hadoop distribution, we need to know that it meets our performance and security standards. With HDP, we can sift through the code of all the components and make sure they are reliable for us and our clients, which they are. And in case we need help, we are not alone Hortonworks and the open source community are there to support us.

Currently, we are upgrading to Hadoop 2, aka YARN. Hadoop YARN brings many enhancements higher scalability, better resource management, NameNode high availability, data snapshots, and more. YARN also opens up Hadoop for new applications other than MapReduce such as Tez for interactive big data querying, Storm for streaming data, and Giraph for graph analysis.

Additionally, our platform uses various other projects that are bundled in HDP 2.0, for example, Pig. The latest version, 0.12, comes with great new features such as the ASSERT operator for data validation, rewritten Avro related functions for serialization, and the beloved IN operator. To add these features we need to integrate the new Pig version in our platform as smoothly as possible.

Upgrading Xplenty with HDP makes life easier. HDP 2.0 is enterprise ready and includes the latest releases of Hadoop, Pig, Oozie, and more. Hortonworks tests that the releases work well together and applies patches where necessary. If we had to upgrade and test everything ourselves it would take a lot more time and cause plenty of frustration for our ninjas. With the help of HDP 2.0, our clients will enjoy Hadoop YARN seamlessly.

Xplenty puts Big Data within reach for companies of all sizes. The innovative Xplenty Hadoop-as-a-Service platform is an easy-to-use cloud service that takes the complexity out of Hadoop, so you can get started right away. 

The post Using HDP for Hadoop Platform-as-a-Service appeared first on Hortonworks.

In this post, we’ll walk through the process of deploying an Apache Hadoop 2 cluster on the EC2 cloud service offered by Amazon Web Services (AWS), using Hortonworks Data Platform.

Both EC2 and HDP offer many knobs and buttons to cater to your specific, performance, security, cost, data size, data protection and other requirements. I will not discuss most of these options in this blog as the goal is to walk through one particular path of deployment to get started.

Let’s go!

Prerequisites

Amazon Web Services account with the ability to launch 7 large instances of EC2 nodes.
A Mac or a Linux machine. You could also use Windows but you will have to install additional software such as SSH clients and SCP clients, etc.
Lastly, we assume that you have basic familiarity with EC2 to the extent that you have created EC2 instances and SSH’d in.

Step 1: Creating a Base AMI with all the OS level configuration common to all nodes

Navigate to your EC2 console from the AWS Dashboard and then click on ‘Launch Instance’:
EC2 Dashboard

Let’s select the RHEL 64bit and go to the next step:

SelectBaseImage

Let’s select a large instance with adequate processing power and memory:

ImageSize

Here we adjust storage as required:

Instance_Storage

We are ready for Review and Launch:

ReviewAndLaunch

But, before you Launch the instance, make sure you have downloaded the private key. Keep the private key safe and Launch:

PrivateKey

Everything looks good. Let’s view the instances.

Now that we have instance up and running, we will need the public DNS name to connect to it:

Let’s SSH in:

SSH

Now let’s prep the instance:

Prep

That was all the prep we need, so we are going to create a private AMI. Go to the EC2 console, select the instance and from the action menu select “Create Image”:

AMI

Make sure you check ‘No reboot’ before you click Create Image, as we will like to continue to work on this instance:

NoReboot

Wait for the creation of the AMI to be complete:

Let’s configure this instance for password-less SSH to all the other nodes in the cluster. The first step is to have the private key on this instance.

We will need to move the private key to .ssh folder and rename it to id_rsa:

Let’s provision the other nodes now:

Select the size of the node instances:

I will select 6 more nodes here with 3 nodes dedicated for all the management daemons and 4 nodes dedicated to data nodes. Then click on ‘Review and Launch’:

Click on the “Launch” button:

Ensure, you are using the same key as before for the passwordless SSH to work between the Ambari node and the rst of the new nodes. Click on the ‘Launch Instance’:

As the instances are getting launched, we will copy down to a text file the Private DNS names of all the instances we have launched so far:

We will end up with a list like below:

Step 2: Customize the security groups to minimize attack surface area while not blocking essential communication channels

We have have to add rules to the security groups which was created by default when we launched the instances.

The first security group should have been created when we launched the first instance. We are running the Ambari server on this instance, so we have to ensure we can get to it and it can communicate with the rest of the instances that we launched later:

Then we also need to open up the ports for IPs internal to the datacenter:

Step 3: Setting up Ambari

Get the bits of HDP and add it to the repo:

next we will refresh the repo:

Then we will install the Ambari server:

Agree to download bits:

Agree to download the key:

Ambari Server bits are installed:

Now, we will configure the bits:

Just accept all the all the default options for all the prompts by pressing Enter:

Let’s start the Ambari Server:

That’s it we are all set to use Ambari to bring up the cluster.

Step 4: Using Ambari to deploy the cluster

Copy the public DNS name of the Ambari:

Navigate to port 8080 of the public DNS from your browser. You should see the login page of Ambari. The default username and password is ‘admin’ and ‘admin’ respectively:

This is where we start creating the cluster. Enter any cluster name of your choosing:

We are going to create a HDP 2.0 cluster:

Remember the list of private DNS names that you had copied down to a text file. We will pull out the list and paste it in the Target host input box. We will also upload the private key that we have been using on this page:

We are all set to go. These should all come back as green with no warnings:

At this stage, we need to decide what services we need:

For this demonstration, I will select everything, although in real life you want to be more judicious and select the bare minimum needed for your requirement:

After we are done selecting the services, it’s time to determine where they will run. Ambari is smart enough to suggest you reasonable suggestions, but if you have specific topology in mind you might want move these around:

Next step is to configure which nodes do you want to Data nodes and Clients to be. I like to have clients on multiple instances just for the convenience.

In the next step we will have to configure the credentials for some of the services. the ones where you will need to populate the credentials are marked by a number in the red background mark:

Once we are done with all the inputs, we are ready to review and then start the deployment:

At this point it will take a while ( ~ 30 mins) to complete the deployment and test the services:

Voila!! We now have a fully functional and tested cluster on EC2. Happy Hadooping!!!

@saptak

The post Deploying a Hadoop Cluster on Amazon EC2 with HDP2 appeared first on Hortonworks.

I recently sat down with Mahadev Konar and Jeff Sposetti to discuss Apache Ambari v1.4.1. Ambari 1.4.1 is a single framework to provision, manage and monitor clusters based on the Hadoop 2 stack, with YARN and NameNode HA on HDFS.

Mahadev is one of the original architects of Apache Hadoop, a co-founder of Hortonworks, and a committer on Apache Ambari and Apache ZooKeeper. Jeff is the Hortonworks product manager focused on Apache Ambari and Apache Falcon.

Together, Mahadev and Jeff explain how Ambari works, innovations included in version 1.4.1, and the future of the Ambari roadmap.

Other highlights include:

Recollections of the challenges managing a Hadoop cluster in the early days, before Apache Ambari existed
The two core use cases for Ambari: unification of all Hadoop ecosystem projects under a single point of control & a single point of integration for other software vendors developing applications to run in Apache Hadoop
The relationship between Ambari & Apache Hadoop YARN
How Ambari allows users to manage NameNode HA for cluster stability
Future plans for extensibility for new services and apps running in Apache Hadoop YARN and integrated into Ambari
Planned improvements to the Ambari user experience
How Ambari provides insight into historical cluster performance & operations

Visit our Apache Ambari project page to learn more.

The post Apache Ambari: Provision, Manage and Monitor Hadoop appeared first on Hortonworks.

This guest post from Steve Ratay, Viewpoint Architect, Teradata Corporation

Teradata’s Unified Data Architecture is a powerful combination of the Teradata Enterprise Data Warehouse, the Aster Discovery Platform, Apache Hadoop (via the Hortonworks Data Platform) and Teradata Enterprise Management tools in a single architecture.

If you are Teradata user managing an Enterprise Data Warehouse or Data Discovery platform, chances are that you are using Teradata Viewpoint, a monitoring and management platform for Teradata Systems. In order to complete Viewpoint’s monitoring of the different systems in Teradata’s Unified Data Architecture, Viewpoint 14.10 includes support for monitoring of multiple Hadoop clusters running in this architecture.

Challenges with Monitoring Hadoop

In an enterprise scenario, the biggest technical challenge in monitoring and managing the Hadoop clusters via Teradata Viewpoint lies in collecting necessary metrics on Hadoop in a reliable and continuous fashion. Different components of Hadoop expose their data in a variety of ways, involve different components like Ganglia (for metric collection), Nagios (for alerting), JMX and other interfaces that are not familiar to enterprise customers. Following are the primary issues using these components for Hadoop monitoring:

Unfamiliarity with the Hadoop management tools, need for training which involves a learning curve and ramp-up period for existing teams responsible for monitoring and management of infrastructure.
Lack of integration with existing enterprise tools. Complexity in handling and understanding multiple user interfaces for diverse systems in the environment.
Parsing the data from each tool and being able to locate and connect to these tools on each Hadoop node is not straightforward.
Significant development time involved to unify different data formats exposed by different technologies on the Hadoop side.
Challenges in locating and communicating with ALL the nodes to obtain the monitoring data. For example, to collect the data form NameNode and job tracker, the location of these services needs to be configured. Connectivity to different nodes poses security issues as well.
Additional backup, restore and archiving strategies needs to be in-place to account for Hadoop systems.
Lack of knowledge from our Teradata users about these new management tools.
Parsing the data from each different interface and being able to locate and connect to these interfaces on each Hadoop node is not easy.

Each of these technologies exposes their data in a different format, and it would take significant development time to properly parse the data from each source. There’s also a challenge in locating and communicating with the nodes to obtain this data. Just to collect data from the namenode and jobtracker, the location of these services would have to be configured or discovered, and then failover would have to be accounted for as well. Expanding the monitoring solution beyond that to collect data from every node poses both connectivity and security issues as well. Surely there must be a better way!

Monitoring Solutions with Apache Ambari and Viewpoint

Luckily Apache Ambari addresses all of these challenges and concerns by providing a collection of RESTful APIs from which a plethora of Hadoop monitoring data can be obtained, translated into easy-to-understand metrics and presented in a fashion that is already familiar to our users. There is no learning curve or additional training needed. Ambari handles the work of collecting the monitoring data from a variety of the monitoring technologies mentioned above. It then aggregates this data and provides a series of RESTful APIs. These APIs can all be accessed by making web service calls against a central node in the Hadoop cluster. All data is provided in JSON format so it can easily be parsed by just about any programming language.

Teradata Viewpoint follows standard data collection practices to collect data from Apache Amabari and stores in the Viewpoint database, which is scheduled for regular backups. The data is collected from Ambari every minute by default, and therefore the database has a view of the state of the Hadoop system over the course of an hour, day, or week. This historical data is used to generate a variety of different charts in the Viewpoint web portal, and also to enable the use of Rewind to enable users to go back and see exactly what was occurring on the Hadoop cluster at a specific point in time. This enables highly efficient troubleshooting when issues actually occur or to predict the load on a particular system in future.

By leveraging Apache Ambari’s capabilities, Teradata Viewpoint delivers a comprehensive monitoring solution for ALL the systems in your enterprise data architecture, including Multiple Hadoop Clusters. Teradata’s Java and Web developers are focused on the tasks at which they excel: Getting the data from the source system (Ambari), transforming them into easy-to-understand metrics, and displaying it in Viewpoint’s “Portlets”. No time was wasted trying to get up to speed on Ganglia, JMX, or many of the details of Hadoop’s inner workings. Ambari was a critical piece of technology to help Viewpoint roll out this solution and enhances Teradata’s Unified Data Architecture.

Teradata Viewpoint offers the following benefits to enterprise users:

Single Pane of Glass – to monitor and manage ALL systems in your architecture, including multiple Hadoop clusters. There is no need for multiple systems for monitor/manage purposes.
Completely Customizable User Interface – Viewpoint UI can be configured with multiple portlets or dashboards to get a snapshot of health and capacity of ALL systems at once or view deeper metrics on a single system.
Integration into Existing Tools – No additional set-up or installation of packages is needed. Viewpoint leverages in-built capabilities in Apache Ambari to monitor Hadoop and other existing Teradata systems, allowing getting more ROI from existing enterprise tools.
Scalable System – Teradata Viewpoint is designed to scale according to customer’s needs in the data architecture. Viewpoint brings years of maturity in monitoring large scale systems at our customers.
Configurable Alerts – User can completely configure the thresholds for multiple alerts levels. Notifications can be sent via emails or any other existing mechanisms.
Historical data – Viewpoint lets you rewind to metrics to a particular point of time in the past to ease troubleshooting and better predict the future capacity needs.
Web Browser – viewpoint is a intuitive and easy-to-use browser based application, with minimal load on the systems that are being monitored

Following are some of the dashboards or portlets, developed specifically for Hadoop. For information on additional dashboards, please refer to the links at the bottom of this blog.

Hadoop Services: Displays summary information for all Hadoop services, typical user is a Hadoop admin or a central DBA
System Health: KPI indicator of system performance and system state for Hadoop and/or other systems, typical user is a Hadoop admin or a central DBA
Alert Viewer: Monitor, configure and Manage logged Alerts for Hadoop systems, typical user is a Hadoop admin or a central DBA
Node Monitor: Displays summary information about the nodes on a Hadoop system, typical user is a Hadoop admin or a central DBA
Space Usage: Monitoring and managing disk space usage (Perm/Temp/Spool), typical user is a Hadoop admin or a central DBA
Metrics Analysis View several systems and metrics over a period of time for trending analysis, typical user is a Hadoop admin or a central DBA
Metrics Graph Graphical representation of system metrics, typical user is a Hadoop admin or a central DBA or a manager.
Capacity Heatmap Interactive visualization tool for analyzing hotspots of various system metrics over user definable time periods, typical user is a Hadoop admin or a central DBA or a manager.

At the current time Teradata Viewpoint is only available to Teradata customers that have acquired Teradata Appliance for Hadoop or Teradata Aster Big Analytics Appliance.

More Information

Check out Teradata Viewpoint on Teradata Developer Exchange.
Check out Apache Ambari for provisioning, managing, and monitoring Hadoop.
Check out Hortonworks Data Platform for Hadoop.

The post Extending Apache Ambari and Hadoop data to the Teradata Ecosystem appeared first on Hortonworks.

In this post, we will explore how to quickly and easily spin up our own VM with Vagrant and Apache Ambari. Vagrant is very popular with developers as it lets one mirror the production environment in a VM while staying with all the IDEs and tools in the comfort of the host OS.

If you’re just looking to get started with Hadoop in a VM, then you can simply download the Hortonworks Sandbox.

Prerequisites

Spin up a VM with Vagrant

Create a folder for this VM: mkdir hdp_vm

If you have Virtual Box and Vagrant installed on your system, change directory to it and issue the following command:

vagrant box add hdp_vm https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box

Once it has completed the download and added to your library of VMs with the name hdp_vm, issue the command:

vagrant init hdp_vm

This will create a file ‘Vagrantfile’ in the folder. Open it in a text editor like ‘vi’:

Edit the ‘Vagrantfile’, so that port 8080 on the VM is forwarded to port 8080 on the host:

Let’s also modify the settings so that the VM is assigned adequate Memory once it is launched:

We are ready to launch the VM. Once the VM is launched, SSH in and login as root and change to the home directory of the ‘root’:

Configure the VM

Find out the default hostname of the VM and note it down:

Then we need to edit the ‘/etc/hosts’ file so that we have an entry of this hostname. Open ‘/etc/hosts’ in ‘vi’ and it might look like this:

It needs to looks like this:

Now we will install the NTP service with the following commands:

yum install ntp

Next we will install the wget utility with the following commands:

yum install wget

Once these are installed turn on the ntp service with the commands:

chkconfig ntpd on service ntpd start

Setting up passwordless SSH

Get a pair of keys: ssh-keygen

The keys will be placed in the folder .ssh.

Copy the id_rsa file to /vagrant folder so that you can access the private key from the host machine as /vagrant is automatically the shared folder between host and guest OSs.
Also append id_rsa.pub, the public key to the authorized_keys keys file.

Setup Ambari

Download and copy the Ambari repository bits to /etc/yum.repos.d:

wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.3.38/ambari.repo cp ambari.repo /etc/yum.repos.d

Double check that the repo has been configured correctly:
yum repolist

Now we are ready to install the bits from the repo:
yum install ambari-server

Now we can configure the bits. I just go with the defaults during the configuration:
ambari-server setup

Let’s spin up Ambari:
ambari-server start

Setting up the pseudo-cluster with Ambari:

Now you can access Ambari from your host machine at the url http://localhost:8080. The username and password is admin and admin respectively:

Name your cluster:

Select HDP 2.0:

Input the hostname of your VM and click on the Choose File button:

Select the private key file you can find in the folder you created at the beginning of this post:

Select the default options for the rest of the steps till you get to Customize Services. In this step, configure your preferred credentials especially for the components marked with a white number against the red background:

Finish up the wizard.

Voila!!! We have our very own Hadoop VM.

Happy Hadooping!

@saptak

The post How to build a Hadoop VM with Ambari and Vagrant appeared first on Hortonworks.

Earlier this week Microsoft announced via their blog that a new version of Windows Azure HDInsight is available in public preview.

Microsoft recognizes the importance of the technical innovation in and around YARN as well as Hortonworks leadership in this area and we have worked collaboratively to bring Hadoop 2.2 to Azure via our Hortonworks Data Platform 2.0 for Windows release.

Apache Hadoop YARN is the data operating system for Hadoop and greatly expands the applications possible of this emerging technology by allowing multiple processing frameworks such as streaming or graph processing to plug in natively. It also improves the efficiency of clusters allowing them to better utilize resources and improve performance.

Performance improvements delivered

In their post Microsoft describes the substantial performance improvements delivered with this latest release and described their collaboration on the Stinger initiative to bring these improvement to market.

“This release of HDInsight is important because it is engineered on the latest version of Apache Hadoop 2.2 to provide order magnitude (up to 40x) improvements to query response times, data compression (up to 80%) for lower storage requirements, and leveraging the benefits of YARN (upgrading to the future “Data Operating System for Hadoop”).

The 40x improvements to query response times and up to 80% data compression are due to the collaboration between Microsoft, Hortonworks and other community contributors with the Stinger project. Microsoft leveraged the best practices developed in the optimization of SQL Server’s query execution engine to optimize Hadoop. We are pleased to bring enhancements to Hadoop that support such a dramatic performance improvement back to the open source community.”

Deployment choice

Because HDInsight is built on HDP 2.0 for Windows customers have an unprecedented choice of deployment options for on-premise, cloud or hybrid deployment. They can start off in one deployment mode and move seamlessly to another as their requirements evolve.

Shared Vision

Microsoft and Hortonworks have a shared vision of open innovation in and around Apache Hadoop and a commitment to deliver that via a 100% open source platform. This significant new offering enables organizations looking to deploy Hadoop based applications in the cloud to leverage the YARN based architecture of Hadoop 2.0.

This is another representative example of the adage that the vendors you rely on, rely on Hortonworks for Hadoop. Read more about our partnership, and our joint engineering efforts.

The post Hadoop 2 and YARN available on Windows Azure HDInsight Preview appeared first on Hortonworks.