Arquivo | Big Data RSS for this section

Cloudera is rebuilding machine learning for Hadoop with Oryx


Hadoop software vendor Cloudera didn’t make a lot of waves when it bought a London-based startup called Myrrix last year, and it hasn’t made a lot of noise about the company’s machine learning technology since then. But the company’s technology and its founder, Sean Owen, could turn out to be very valuable assets.

Owen, whose official title is director of data science, now spends him time working on an open source machine learning project called Oryx. (It’s a species of African antelope; Cloudera also sells a product called Impala). Oryx is intended to help Hadoop users build machine learning models and then deploy them so they can be queried and serve results in real time, say as part of a spam filter or a recommendation engine. Ideally, Oryx will also suuport models that can update themselves as data streams in.

Owen calls it the difference between Hadoop’s traditional sweet spot…

Ver o post original 641 mais palavras


The Rise of Analytics 3.0

Excelente apresentação de Thomas H. Davenport, considerado um dos Data Scientists mais influentes do mundo.

The Rise of Analytics 3.0: How to Compete in the Data Economy

Nesta apresentação, ele divide Analytics em 3 períodos:

  1. Analytics 1.0 -> Tradicional
    • Basicamente análise descritiva e reporting
    • Dados internos e estruturados
    • Pouco contato entre analistas e área de negócios
    • Suporte à decisão interna
  2. Analytics 2.0 -> Big Data
    • Análise preditiva e prescritiva
    • Dados complexos, não estruturados e de diferentes fontes (internos e externos)
    • Novos recursos computacionais e analíticos (Machine Learning!)
    • Data Scientists
    • Produtos e serviços baseados em dados (empresas online)
  3. Analytics 3.0 -> Data Economy
    • Todas as decisões são baseadas ou influenciadas pelos dados
    • Rápida entrega de insights
    • Ferramentas de análise estão disponíveis para quem toma as decisões
    • Análise é incorporada dentro dos processos operacionais e de decisão
    • Todas as empresas podem criar produtos e serviços baseados em dados

MIT to offer its first professional MOOC in big data


The Massachusetts Institute of Technology has been involved in online education since the early days, and now it’s taking it a step further. Yesterday, the college announced its first online, professional-leaning Massively Open Online Course (MOOC), entitled “Tackling the Challenge of Big Data.”

Led by a dozen faculty from the university’s Computer Science and Artificial Intelligence Laboratory (CSAIL) at the School of Engineering, the four-week course starts at the beginning of March and is directed specifically at technical professionals and executives — not academic-types. The course is the first in a new set of courses offered by the university called Online X, which offers professional classes through the edX platform.

One important thing, though: these classes may be open, but they don’t come cheap. Participating in the course will run users $495 — far from the free price tags of many MOOCs available. But it’s likely that extra cost…

Ver o post original 144 mais palavras

Maybe big data is the killer app for Google’s cloud


Google’s Compute Engine cloud doesn’t yet have a Hadoop offering of its own, but the platform is making a name for itself as a viable, if not ideal, place to run big data workloads. The latest validation came on Thursday when Qubole, the Hadoop-as-a-service startup from Hive creators Ashish Thusoo and Joydeep Sen Sarma, announced an option that users can choose to run on Compute Engine, which they claim provides better performance than Amazon Web Services.

Specifically, a company spokesperson told me via email, Qubole has seen 2-3x faster startup times for virtual servers using Compute Engine over Amazon EC2 and more reliable performance from Google Cloud Storage than from Amazon S3. We’ll also assume that AWS is the “CloudX” against which Qubole engineer Praveen Seluka benchmarked Compute Engine, some results of which he shared on the Google Cloud Platform blog. Qubole did launch as an AWS-based service…

Ver o post original 306 mais palavras

Facebook open sources its SQL-on-Hadoop engine, and the web rejoices

Mais uma tecnologia desenvolvida dentro de empresas que não estão na briga com empresas como Cloudera, Hortonworks, etc. Interessante ver como grande parte da inovação tem vindo de empresas como Facebook, Google, Linkedin, etc.


Facebook has open sourced Presto, the interactive SQL-on-Hadoop engine the company first discussed in June. Presto is Facebook’s take on Cloudera’s Impala or Google’s Dremel, and it already has some big-name fans in Dropbox and Airbnb.

Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database.

Presto is 10 times faster than Hive for most queries, according to Facebook software engineer Martin Traverso in a blog post detailing today’s news.

Technologically, Hive and Presto are very different, namely because the former…

Ver o post original 399 mais palavras