Featured Job: Social Media Executives / Managers / Online Community Executives - Inventcorp, Hyderabad
News »Browse Articles » Is Hadoop the cloud`s killer app?
0
Vote Vote

Is Hadoop the cloud`s killer app?

Views 0 Views    Comments 0 Comments    Share Share    Posted 12-07-2009  

Killer Instinct?

It`s named after a stuffed elephant, but the Apache Hadoop project is no toy. It`s designed to handle the largest datasets in the world, as well as perform the double duty of both cluster management and distributed file system. Developers in enterprises around the world have been building such systems from scratch since the dawn of grid computing, and with Hadoop approaching version 1.0, an alternative is at hand.

That`s not to say that Hadoop is without warts. For the past three years of formal development, the project has consistently broken backward compatibility, and many users have cited security as an ongoing concern. But Hadoop creator Doug Cutting, also an employee of Yahoo, says that both of these issues should see solutions in the next two releases.

Mike Fitzgerald, COO of Adknowledge, said that his company has been using Hadoop for almost a year now. His team runs Hadoop in Amazon`s EC2 cloud, but it uses its own implementation rather than Amazon`s official Hadoop services.

Adknowledge uses its Hadoop cluster to sift through customer data to determine which ads are best suited to which customers. He said that, on average, his team`s Hadoop cluster sifts through approximately 40 terabytes of data at a time in a batch job.

Fitzgerald said that developing applications to run on Hadoop requires the understanding of some new concepts. “You need to understand the concepts of map/reduce, and distributed computing," he said.

"We use Java and have found it relatively easy to write Java code that leverages the Hadoop framework. The most important thing to ask is, `What are the problems you`re trying to solve with Hadoop?` The best times to use it are when you`re doing things that require very large-scale computation with a lot of data.

“We have in the past used big iron databases like Netezza, and we have a lot of Oracle. When you reach that scale, you really challenge what those things can handle. You`re better off in an environment where you`re adding commodity hardware to a cluster."

Hadoop`s history
Hadoop began life when Cutting started to build Nutch, an open-source search engine application. He had previously created the Apache Lucene project, which produced an open-source information-retrieval library written in Java. Based on that project, He began working on Nutch around 2004 with Mike Cafarella. Cutting said that a great deal of the work involved in Nutch was creating the underlying cluster infrastructure for physically scaling the platform.

“The only people who could scale to the size of the Web were Google, Microsoft and Yahoo,” said Cutting. “Google and Microsoft have similar technology, presumably, that they use internally, but those are special. There`s also database technologies, which purport to scale. I don`t think they scale as far, or as easily. But they also have different performance comparisons, so it`s apples to oranges.

"Hadoop is designed for much more generic data processing. It doesn`t require an extensive indexing or data-loading step. It`s presenting all of your data ahead of time. All that classic database analysis isn`t required.”

Cutting eventually found that the infrastructure beneath Nutch was becoming more powerful and elaborate, especially after he read Google`s paper on map/reduce. In 2006, he joined Yahoo, and the infrastructure project was officially named after a stuffed elephant: Hadoop. Today, Yahoo houses the world`s largest Hadoop cluster, coming in at 4,000 nodes. This cluster contributes to every Yahoo search performed.

With a full team working on Hadoop and its supporting tools and projects, Yahoo and Cutting have pushed the project to version 0.20.0. While there is no set date for the release of version 1.0, the Hadoop team is striving to release it before the end of the year.

Hadoop is made up of a number of subprojects. These include a distributed file system (HDFS), the HBase database, and the Pig language for building data queries. As an Apache Foundation project, however, Hadoop is surrounded by alternative tools. Amazon substitutes its own S3 storage services for HDFS, and Facebook has constructed its own data warehouse infrastructure (Hive) with a SQL-like substitute for Pig.

Ashish Thusoo, engineering manager at Facebook, said his team uses a 600-node Hadoop cluster. He said that Hadoop is useful for business intelligence and summarization applications.

“Our ad insight numbers are generated in Hadoop and Hive. It`s a widely published system here, and we get 3,000 jobs a day with more than 100 users using it internally. It`s useful for analytics on all sorts of structured data, as well as unstructured data,” he said.

Hot property
So compelling is the Hadoop story that Christophe Bisciglia, founder of Cloudera, said that he had to “fend off investors with a stick.” Cloudera packages Hadoop into numerous forms for use on the various Linux distributions and within Amazon`s EC2. The company also offers numerous training......

Source:
http://www.sdtimes.com/IS_HADOOP_THE_CLOUD_S_KILLER_APP_/By_Alex_Handy/About_APA
0
Vote  Vote
Enter your comment:
No Comments For This News

Search News

What's the News?

Post a link to something interesting from another site, or submit your own original writing for the JOSO community to read.

Most Popular News

Most Recent User Submitted News