Apache: Big Data Europe 2016
Click here to Register or for more information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Column [clear filter]
Wednesday, November 16

12:00 CET

Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics - Mike Percy, Cloudera
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.

This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.

avatar for Mike Percy

Mike Percy

Software Engineer, Cloudera
Mike Percy is a software engineer at Cloudera and a PMC member on Apache Kudu, an open source distributed column store for the Hadoop ecosystem. He is also a PMC member on Apache Flume. Prior to joining Cloudera, Mike worked at Yahoo! building machine learning infrastructure for Big... Read More →

Wednesday November 16, 2016 12:00 - 12:50 CET
Nervion/Arenal I

13:00 CET

Parquet Format in Practice & Detail - Uwe L. Korn, Blue Yonder
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

avatar for Uwe L. Korn

Uwe L. Korn

Data Scientist, Blue Yonder GmbH
Uwe Korn is a Data Scientist at the German RetailTec company Blue Yonder. His expertise is on building architectures for machine learning services that are scalably usable for multiple customers aiming at high service availability as well as rapid prototyping of solutions to evaluate... Read More →

Wednesday November 16, 2016 13:00 - 13:50 CET
Nervion/Arenal I