Loading…
Apache: Big Data Europe 2016
Click here to Register or for more information 
Sunday, November 13
 

17:00 CET

Pre-registration Open
Sunday November 13, 2016 17:00 - 19:00 CET
Triana Foyer
 
Monday, November 14
 

07:00 CET

Morning Run
Come meet in the lobby of the Melia Sevilla at 7:00am for a morning run. The plan is to cross to the park and run next to the river. 

This will last an hour and the group will be back by 8:00am.

Monday November 14, 2016 07:00 - 08:00 CET
Melia Sevilla Hotel Lobby

08:30 CET

Breakfast
Monday November 14, 2016 08:30 - 09:30 CET
Giralda Foyer

08:30 CET

Registration
Monday November 14, 2016 08:30 - 17:20 CET
Triana Foyer

09:30 CET

Keynote: Welcome & Opening Remarks - Rich Bowen, Vice President, Conferences, Apache Software Foundation
Speakers
avatar for Rich Bowen

Rich Bowen

Open Source Strategist, AWS
Rich has been doing open source since before we called it that. He's a member and director at the Apache Software Foundation, and has been active on major open source projects including the Apache HTTP Server, Perl, PHP, Wordpress, and OpenStack. He's an Open Source Evangelist at... Read More →


Monday November 14, 2016 09:30 - 09:40 CET
Giralda I/II

09:45 CET

Keynote: Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It - Stephan Ewen, CTO, Data Artisans
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays.

The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges.

We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.

Speakers
avatar for Stephan Ewen

Stephan Ewen

CTO, Data Artisans
Stephan Ewen is Apache Flink PMC member and co-founder and CTO of data Artisans. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project. Stephan has a PhD in Computer Science from TU Berlin.


Monday November 14, 2016 09:45 - 10:05 CET
Giralda I/II

10:10 CET

Keynote: Training Our Team in the Apache Way - Alan Gates, Co-Founder, Hortonworks
Hortonworks contributes to a number of Apache projects.  When we started we depended on our many experienced Apache community members to train their fellow Hortonworkers in the Apache Way.  But we grew quickly, and we found this started to break down.  So we have instituted training for our teams in what Apache is, how it works, their responsibilities as part of Apache and how that meshes with their responsibilities as Hortonworkers, and a practical list of dos and don’t. This talk will share some thoughts on the need for this training, give an overview of the content, and review some early results.

Speakers
avatar for Alan Gates

Alan Gates

Co-founder and Architect, Hortonworks
Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics from... Read More →


Monday November 14, 2016 10:10 - 10:25 CET
Giralda I/II

10:25 CET

Coffee Break
Monday November 14, 2016 10:25 - 11:00 CET
Giralda Foyer

11:00 CET

Demonstrating the Societal Value of Big & Smart Data Management - Simon Scerri, Fraunhofer IAIS
H2020 BigDataEurope is a flagship project of the European Union's Horizon 2020 framework programme for research and innovation. In this talk we present the Docker-based BigDataEurope platform, which integrates a variety of Big Data processing components such as Hive, Cassandra, Apache Flink and Spark. Particularly supporting the variety dimension of Big Data, it adds a semantic data processing layer, which allows to ingest, map, transform and exploit semantically enriched data. In this talk, we will present the innovative technical architecture as well as applications of the BigDataEurope platform for life sciences (OpenPhacts), mobility, food & agriculture as well as industrial analytics (predictive maintenance). We demonstrate how societal value can be generated by Big Data analytics, e.g. making transportation networks more efficient or facilitating drug research.

Speakers
SS

Simon Scerri

BDE Deputy Coordinator, Fraunhofer IAIS
Simon Scerri is a senior postdoc in the “Enterprise Information Systems” department at Fraunhofer IAIS and at the University of Bonn. In 2011, Simon received his Ph.D. from the Faculty of Engineering at the National University of Ireland, Galway. Prior to joining Fraunhofer, Simon contributed to research efforts (2005–2013) at the Digital Enterprise... Read More →


Monday November 14, 2016 11:00 - 11:50 CET
Santa Cruz

11:00 CET

The Role of Apache Big Data Stack in Finance: A Real World Experience on Providing Added Value to Online Customers - Luca Rosellini, KEEDIO
Nowadays, the main burden of BigData adoption is clearly the integration of new infrastructure and technologies with legacy systems, especially when dealing with data ingestion.



In this talk, KEEDIO will present the details of the aforementioned BigData architecture, deployed in a hybrid infrastructure for a rising bank in Spain, in order to provide added value to its customers. This success story has been possible by means of custom analytics built on top of several components of the Apache Stack.



The main and most interesting issues of this deployment will be explained as well as the their solutions based on tools like Apache NiFi, Apache Spark, Apache Mesos and Apache Zeppelin. Thus, the complete ingestion architecture will be outlined, as well as data consolidation and processing.



Finally, the low latency online data exploitation architecture will be explained.

Speakers
LR

Luca Rosellini

CTO, KEEDIO
Luca has been working on Big Data project for major Spanish corporations for the last four years. He now serves as the CTO of KEEDIO, a young spanish startup focused in solving BigData problems in banking environments. He holds a master degree in computer engineering at the university... Read More →


Monday November 14, 2016 11:00 - 11:50 CET
Giralda III/IV

11:00 CET

Geospatial Track: Apache SIS for Earth Observation and Beyond - Martin Desruisseaux, Geomatys
Apache SIS is a library for helping developers to create their own geospatial application. SIS follows closely international standards published jointly by the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). In this talk we will show how SIS provides a unified metadata model based on ISO 19115 standard for summarizing the content of some file formats used for earth observation: GeoTIFF, NetCDF, Landsat 8 and MODIS. We will show how to get the Coordinate Reference System (CRS) from those file formats or from other sources like Well Known Text (WKT) 2 or registry maintained by authorities, and how to use those CRS for coordinate operations. We will present new issues to take in account when applying those tools to extra-terrestrial bodies like Mars or asteroids. Finally we will present next developments proposed for Apache SIS.

Speakers
MD

Martin Desruisseaux

Developer, Geomatys
I hold a Ph.D thesis in oceanography, but have continuously developed tools for helping analysis work. I used C/C++ before to switch to Java in 1997. I develop geospatial libraries since that time, initially as a personal project then as a GeoTools contributor until 2008. I'm now... Read More →


Monday November 14, 2016 11:00 - 11:50 CET
Carmona

11:00 CET

Practical Graph Analytics with Apache Giraph - Roman Shaposhnik, Pivotal
This talk will help you build data mining and machine learning applications using Apache Giraph framework for graph processing. This talk is based on the "Practical Graph Analytics with Apache Giraph" book trying to be as hands-on as possible. Apache Giraph offers a simple yet flexible programming model targeted to graph algorithms and designed to scale easily to accommodate massive amounts of data. Originally developed at Yahoo!, Giraph now enjoys a diverse community of contributors from who's-who of Silicon Valley companies: Facebook, LinkedIN and Twitter.

Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels


Monday November 14, 2016 11:00 - 11:50 CET
Giralda VI/VII

11:00 CET

Hive 2.0 SQL, Speed, Scale - Alan Gates, Hortonworks
Apache Hive is the most commonly used SQL interface for Hadoop. To meet users data warehousing needs it must scale to petabytes of data, provide the necessary SQL, and perform in interactive time. The Hive community ihas produced a 2.0 release of Hive that includes significant improvements. These include:

* LLAP, a daemon layer that enables sub-second response time.

* HBase to store Hiveäó»s metadata, resulting in significantly reduced planning time.

* Using Apache Calcite to build a cost based optimizer

* Adding procedural SQL

* Improvements in using Spark as an engine for Hive execution

This talk will cover the use cases these changes enable, the architectural changes being made in Hive as part of building these features, and share performance test results on how these improvements are speeding up Hive.

Speakers
avatar for Alan Gates

Alan Gates

Co-founder and Architect, Hortonworks
Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics from... Read More →


Monday November 14, 2016 11:00 - 11:50 CET
Nervion/Arenal II/III

11:00 CET

Putting A Spark in Web Apps - David Fallside, IBM
Web app developers want to take advantage of the sophisticated analytics and big data processing provided by engines such as Apache Spark. Traditionally, web app development would use an enterprise language like Java but with a development emphasis now on agility and simplicity, technologies such as Node.js, Ruby on Rails, and PHP are increasingly being used for such development. Apache Spark has APIs for Scala, Java and Python but no API for Node.js or JavaScript despite their importance for web app development. To fill this gap, the EclairJS open-source project was created to provide an API in Node.js and JavaScript and enable web app developers to incorporate the analytic and other capabilities of Spark. In this presentation, David Fallside will show some web applications that demonstrate Sparkäó»s capabilities and explain how they are implemented using EclairJS.

Speakers
avatar for David Fallside

David Fallside

Technologist, IBM
David Fallside works in an emerging tech team at IBM that develops the open-source EclairJS project and provides Node.js with an Apache Spark API. Some of the team’s previous projects include LoB tools for Spark and Hadoop, and information engineering for IBM’s Watson... Read More →



Monday November 14, 2016 11:00 - 11:50 CET
Giralda I/II

11:00 CET

Apache Gearpump Next-Gen Streaming Engine - Karol Brejna & Huafeng Wang, Intel
Stream processing goes mainstream in the Big Data world and becomes widely adopted in the industry. Despite its expanding popularity, many hard problems remain to be solved. Apache Gearpump(incubating) is a next-gen streaming engine designed to solve the hard parts in stream processing. It is good at streaming infinite out-of-order data and guarantees correctness. It helps user to easily program streaming applications, get runtime information and update dynamically. In this presentation, we will demystify how Gearpump solves the hard parts in stream processing and achieves high throughput at millisecond latency message delivery.

Speakers
avatar for Karol Brejna

Karol Brejna

Intel
Father, husband, software enthusiast. After over a dozen years of struggling with system integration, service and event oriented/driven architectures, business process management, enterprise content management, NoSQLs, ESBs, clouds joined Intel to work for Analytics and Artificial... Read More →
avatar for Huafeng Wang

Huafeng Wang

Software engineer, Vipshop
Huafeng is a software engineer from Intel's Big Data engineering group, as well as a committer of Apache Gearpump, which is an open sourced streaming process engine initiated by Intel.



Monday November 14, 2016 11:00 - 11:50 CET
Nervion/Arenal I

12:00 CET

Data Science with Spark and Case Study with Non-Motorized Travel Social Data for the Public - Yi Fan Zhang, IBM
The collection, documentation, management and analysis of big data associated with non-motorized travel has not attracted enough attentions. This may not conform to the trend that cycling, walking and jogging are strongly advocated by governments to build low-carbon cities and also to improve peopleäó»s health conditions. This session will share the experience that quantify and characterize the non-motorized travel by means of tempo-spatial analysis. The data used in this case is captured from a famous online community for running amateurs sharing their activities. Around 0.5 million running and cycling records from 0.3 million people in Beijing are analyzed with machine learning and data science methodology in this case study. Spark ML with random forest algorithm, and grid search of the parameters selection have been used on the prediction upon weather, AQI and time.

Speakers
avatar for Yi Fan Zhang

Yi Fan Zhang

Software Engineer, IBM
Working in Cloud Data Service, Big data, Entity Analytics Development, IBM China Development Lab. Recently, I am working on the Smart Traffic with People/Vehicle Trajectory Analysis Platform: Including build a Spark distributed computing environment,design and develop Spark applications... Read More →


Monday November 14, 2016 12:00 - 12:50 CET
Giralda III/IV

12:00 CET

Geospatial Track: Geospatial Big Data: Software Architectures and the Role of APIs in Standardized Environments - Ingo Simonis, Open Geospatial Consortium (OGC)
A number of technologies have evolved around big data, in particular products from the Apache community such as Hadoop, Storm, Spark, Hive, or Cassandra. The geospatial community has developed a range of standards to handle geospatial data in an efficient way. Most of these standards are produced by the Open Geospatial Consortium (OGC) and implemented in the form of domain-agnostic data models and Web services. With the emerging demand for streamlined APIs, new questions emerge how access to Big Data in the geospatial community can be handled most efficiently, how existing standards serve these new demands and implementation realities with distributed Big Data repositories operated e.g. by the various space agencies. This presentation should stimulate the discussion of geospatial Big Data handling in standardized environments and explore the role of products from the Apache community.

Speakers
avatar for Ingo Simonis

Ingo Simonis

Director Innovation Programs & Science, OGC
Dr. Ingo Simonis is director of interoperability programs and science at the Open Geospatial Consortium (OGC), an international consortium of more than 525 companies, government agencies, research organizations, and universities participating in a consensus process to develop publicly... Read More →


Monday November 14, 2016 12:00 - 12:50 CET
Carmona

12:00 CET

Graph Processing with Apache Tinkerpop on Apache S2Graph - Doyung Yoon, Kakao Corp.
Since the last conference, Apache S2Graph community has been working on the integration with Apache Tinkerpop. Tinkerpop users are now able to use S2Graph as graph database without changing their Thinkerpop code, and also execute OLAP graph queries over their data in HDFS. We will share our experiences to integrate Thinkerpop as a graph database API, and comment on our current limitations and future plans. We will also present the benchmark results showing the comparison between S2Graph and existing graph databases such as Neo4j, Titan, and OrientDB. We focus our benchmarks on the "neighbors of neighbors" queries and the basic CRUD operations. Similar to Titan, S2Graph supports multiple storage backends, such as HBase, Cassandra, Mysql, Postgresql, and RocksDB, and the S2Graph's performance for each backend will be presented.

Speakers
avatar for Doyung Yoon

Doyung Yoon

Software Engineer, Kakao
Doyung works in a distributed graph database team at Kakao as software engineer, where his focus is on performance and usability. He developed Apache S2Graph, an open-source distributed graph database, and has previously presented it at ApacheCon BigData Europe and ApacheCon BigData... Read More →



Monday November 14, 2016 12:00 - 12:50 CET
Giralda VI/VII

12:00 CET

An Overview on Optimization in Apache Hive: Past, Present, Future - Jesús Camacho Rodríguez, Hortonworks
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer.

Speakers
avatar for Jesús Camacho Rodríguez

Jesús Camacho Rodríguez

Member of Technical Staff, Hortonworks
Jesús Camacho Rodríguez is a Member of Technical Staff at Hortonworks, the PMC chair of Apache Calcite, and a PMC member of Apache Hive. His current work focuses on extending and improving query processing and optimization, ensuring that the increasingly complex workloads supported... Read More →



Monday November 14, 2016 12:00 - 12:50 CET
Nervion/Arenal II/III

12:00 CET

Machine Learning in Apache Zeppelin - Alexander Bezzubov, NF Labs
There are many Machine Learning projects available: Apache Mahout, Apache SystemML (incubating), Apache Spark MLlib, Tensorflow, Scikit-learn, etc.

In this session we are going to showcase how typical ML predictive analytics workflow can benefit from modern notebooks-style interactive environment like Apache Zeppelin. We going to share examples of successful integration between different project in big-data ecosystem and touch up on various techniques like visual recognition, NLP, and Deep Learning.


Speakers
AB

Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.


Monday November 14, 2016 12:00 - 12:50 CET
Santa Cruz

12:00 CET

Managing Deeply Nested Documents in Apache Solr - Anshum Gupta, IBM Watson
Apache Solr in the recent past started supporting deeply-nested documents. Solr can now be used to perform search and faceting on documents such as nested email threads, comments and replies on social media, enriched and annotated documents etc. without having to flatten them before ingestion.

Anshum Gupta would discuss pre-processing of data so that it can be indexed in Solr, making it possible to perform complex search and statistical aggregation on top of it. He would also cover query formation for sample use cases of nested data and multiple options and features that Solr provides for faceting or aggregation of such documents.

By the end of this talk, Solr users would have a better understanding of both, how to work with features that Solr provides to find answers to interesting questions from deeply nested documents as well as work-arounds for the missing pieces.

Speakers
avatar for Anshum Gupta

Anshum Gupta

Sr. Software Engineer, IBM Watson
Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and... Read More →


Monday November 14, 2016 12:00 - 12:50 CET
Giralda V

12:00 CET

Building a Scalable Recommendation Engine with Apache Spark, Apache Kafka and Elasticsearch - Nick Pentreath, IBM
There are many resources available for using Apache Spark to build collaborative filtering models. However, there are relatively few for how to build a large-scale, end-to-end recommender system.



This talk will show how to create such a system, using Apache Kafka, Spark Streaming and Elasticsearch for data ingestion, real-time analytics and data storage, Spark DataFrames and ML pipelines for data aggregation and model building, and Elasticsearch for model management, serving and data visualization. We will also explore techniques for scaling model serving, using Spark Streaming for real-time model updates, and how to incorporate state-of-the-art models into this framework.



The talk will be technical and developer-focused, highlighting experiences from building real-world recommender systems, and providing example code (which will be available as open source).

Speakers
avatar for Nick Pentreath

Nick Pentreath

Principal Engineer, IBM
Nick Pentreath is a principal engineer in IBM's Center for Open Source Data & AI Technologies (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match... Read More →


Monday November 14, 2016 12:00 - 12:50 CET
Giralda I/II

12:00 CET

Property-based Testing for Spark Streaming - Adrian Riesco, Universidad Complutense de Madrid
Spark Streaming is currently one of the leading frameworks in the industry for distributed stream processing. However testing Spark Streaming programs is still a challenge, due to the complications of dealing with time. In this presentation, Adrian Riesco gives an introduction to sscheck, a testing library for Spark that extends ScalaCheck with additional temporal logic operators for generators and properties, that are used to define tests for Spark Streaming as linear temporal logic formulas, resulting in tests that are high level and easy to understand.

Speakers
avatar for Adrian Riesco

Adrian Riesco

PhD Assistant Professor, Facultad de Informatica (UCM)
I currently work as PhD Assistant Professor at Universidad Complutense de Madrid, Spain. I am also a member of the research group FADOSS, and my research interests include formal methods, logic, debugging, and testing.



Monday November 14, 2016 12:00 - 12:50 CET
Nervion/Arenal I

13:00 CET

Uber - Your Realtime Data Pipeline is Arriving Now! - Ankur Bansal, Uber
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.



Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.


Speakers
avatar for Ankur Bansal

Ankur Bansal

Sr. Software Engineer, Uber
Ankur Bansal is a senior engineer in Uber's Streaming team. He is currently focused on building Kafka infrastructure and scaling it to keep up with uber's hyper growth. His areas of interest include distributed systems and cloud. Before Uber he worked at eBay where he was part of... Read More →



Monday November 14, 2016 13:00 - 13:50 CET
Giralda III/IV

13:00 CET

Geospatial Track: Crowd Learning for Indoor Navigation - Thomas Burgess, indoo.rs GmbH
indoo.rs enables location based services for indoor applications. With indoo.rs, developers can add new features to their products, including having locations trigger events, track assets, showing closest routes to other places. For this, we use WiFi/beacon radio infrastructure, mobile devices and our cloud which produce lots of geospatial time series data. The real-time indoor navigation fuses independent movement from custom 9D sensor fusion and position estimates obtained by comparing current signal readings to a reference map. This talk will discuss how we create and maintain these maps in our big data machine learning system which leverages crowd data through Kafka and Spark to run SLAM and context aware algorithms to create high quality trajectories. In addition to use in reference maps, these trajectories provide an additional input for our interactive analytics.

Speakers
avatar for Thomas Burgess

Thomas Burgess

Director of research, indoo.rs GmbH
Thomas is the CRO of indoo.rs and leads its research efforts since 2012. Earlier, he did his PhD in particle physics at Stockholm University for the AMANDA/IceCube neutrino telescopes, and worked as a postdoctoral researcher at University of Bergen for the ATLAS experiment at the... Read More →


Monday November 14, 2016 13:00 - 13:50 CET
Carmona

13:00 CET

Apache S2Graph (incubating) as a User Event Hub - Hyunsung Jo, Daewon Jeong & Hwansung Yu, Kakao Corp.
S2Graph is a graph database designed to handle transactional graph processing at scale.

Its API allows you to store, manage and query relational information using edge and vertex representations in a fully asynchronous and non-blocking manner.

However, at Kakao Corp., where the project was originally started, we believe that it could be so much more.

There have been efforts to utilize S2Graph as the centerpiece of Kakaoäó»s event delivery system taking advantage of its strengths such as

- flexibility of seamless bulk loading, AB testing, and stored procedure features,

- multitenancy that allows interoperability among different services within the company,

- and most of all, the ability to run various operations ranging from basic CRUD to multi-step graph traversal queries in realtime with large volumes.

Speakers
avatar for Daewon Jeong

Daewon Jeong

Programmer, kakao
Works on S2Graph team
avatar for Hyunsung Jo

Hyunsung Jo

Kakao
Seoul-based developer interested in large scale data systems and cloud computing. Currently, working as a data systems developer at Kakao Corp., Korea with open source projects such as Apache S2Graph (incubating) and Druid among others. Previous work experience include software... Read More →
HY

Hwansung Yu

Kakao
Developer interested in LBS services and large scale data systems.



Monday November 14, 2016 13:00 - 13:50 CET
Giralda VI/VII

13:00 CET

Hadoop, Hive, Spark and Object Stores - Steve Loughran, Hortonworks
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it's applications don't integrate that well äóîsomething which starts right down at the file IO operations.



This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the

foundational "what's an object store?" to the practical "what should I avoid" and the timely "what's new in Hadoop?" äóî the latter covering the improved S3 support in Hadoop 2.8+.



I'll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code äóîand equally, what they must avoid.



Finally, I'll look at ongoing work, especially "S3Guard" and what its fast and consistent file metadata operations promise.

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache... Read More →


Monday November 14, 2016 13:00 - 13:50 CET
Nervion/Arenal II/III

13:00 CET

Introducing Apache Apex: Next Gen Big Data Processing on Hadoop - Thomas Weise, DataTorrent
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:



* Architecture for high throughput, low latency and exactly-once processing semantics.

* Rich library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more

* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.

* Advanced engine features for auto-scaling, dynamic changes, compute locality.



Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.

Speakers
avatar for Thomas Weise

Thomas Weise

CTO, Atrato.io
Thomas is Apache Apex PMC Chair and CTO at Atrato. Prior to founding Atrato he was Architect at DataTorrent and lead the development of Apex from the beginning of the project. Before that he was member of the Hadoop Team at Yahoo! and contributed to several of the big data ecosystem... Read More →


Monday November 14, 2016 13:00 - 13:50 CET
Nervion/Arenal I

13:00 CET

Distributed In-Database Machine Learning with Apache MADlib (incubating) - Roman Shaposhnik, Pivotal
Data science is moving with gusto to the enterprise, where data often resides in relational databases with SQL as the main workload. So how can an enterprise add a data science dimension to their business without a major IT re-architecture?

Apache MADlib (incubating) is an innovative SQL-based open source library for scalable in-database analytics. It provides parallel implementations of mathematical, statistical and machine learning methods. Bringing machine learning computations to the data makes for excellent scale out performance on massively parallel processing (MPP) platforms like Greenplum database and Apache HAWQ (incubating).

In this talk, we will describe the origin of MADlib, review the architecture and common usage patterns, and look ahead to some interesting plans around performance acceleration.


Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels


Monday November 14, 2016 13:00 - 13:50 CET
Santa Cruz

13:00 CET

Fast & Scalable Email System with Apache Solr - Strategies, Tradeoffs and Optimizations - Arnon Yogev, IBM Research
Email interaction has its unique characteristics and is different than traditional web search (for example in that users search their own private mailboxes and are often interested in recent emails rather than the archive).

Taking advantage of these characteristics, we were able to optimize our infrastructure in terms of indexing strategy and query optimization and achieve a significant gain in scalability and performance.

Arnon will present the various tradeoffs that were explored, including multi-tiered indexes, sorted indexes, query optimizations and more.

Arnon will then present the benchmark results that stress the importance of correctly designing a Solr infrastructure and tailoring it to oneäó»s specific use case.

Speakers
avatar for Arnon Yogev

Arnon Yogev

Software Developer & Researcher, IBM Research
Arnon is a software engineer in IBM Research, part of the Social Analytics & Technologies team, Big Data and Cognitive Analytics department. Arnon earned his MBA degree and his B.Sc in Computer Science from the Technion. Being part of the Social Analytics & Technologies team, Arnon's... Read More →


Monday November 14, 2016 13:00 - 13:50 CET
Giralda V

13:00 CET

Building Apache Spark Application Pipelines for the Kubernetes Ecosystem - Michael McCune, Red Hat
Apache Spark based applications are often comprised of many separate, interconnected components that are a good match for an orchestrated containerized platform like Kubernetes. But with the increased flexibility afforded by these technologies comes a new set challenges for building rich data-centric applications.



In this presentation we will discuss techniques for building multi-component Apache Spark based applications that can be easily deployed and managed on a Kubernetes infrastructure. Building on experiences learned while developing and deploying cloud native applications on an OpenShift platform, we will explore common issues that arise during the engineering process and demonstrate workflows for easing the maintenance factors associated with complex installations.

Speakers
avatar for Michael McCune

Michael McCune

Senior Principal Software Engineer, Red Hat
Michael McCune is a software developer creating open source infrastructure and applications for cloud platforms. He has a passion for problem solving and team building, and a lifelong love of music, food, and culture.


Monday November 14, 2016 13:00 - 13:50 CET
Giralda I/II

13:50 CET

15:30 CET

Fighting Identity Theft: Big Data Analytics to the Rescue - Seshika Fernando, WSO2
Identity Theft is no longer just a consumeräó»s problem. Attackers are now targeting Enterprises for bigger financial gains and greater damage not just to the organizationäó»s infrastructure but more importantly to its corporate image.



While Enterprise Identity Theft Analytics Tools do exist, most organizations find it economically prohibitive to invest in expensive proprietary software. In this session, Seshika will show how a comprehensive Identity Theft Analytics Solution can be built using Open Source Technologies. She will demonstrate how Big Data Analytics can be used to safeguard any Enterprise by covering the 4 Aäó»s of Identity Analytics

äó¢ Authentication Analytics

äó¢ Authorization Analytics

äó¢ Audit Trail Analytics

äó¢ Adaptive Analytics

Speakers
SF

Seshika Fernando

Seshika is a Senior Technical Lead at WSO2 and focuses on the applications of WSO2’s middleware platform in Financial Markets. Throughout her career, she has had extensive experience in providing technology for Stock Exchanges, Regulators and Investment Banks from across the... Read More →


Monday November 14, 2016 15:30 - 16:20 CET
Giralda III/IV

15:30 CET

Performance Monitoring for the Cloud - Werner Keil, Agile Coach
Performance Monitoring tools like Performance Co-Pilot (PCP) existed almost longer than the World Wide Web. It was developed in the early 90s by SGI. Parts were made available open source from 2000 on, which led to a further spread of the tool. In recent years an active community formed and a variety of new features and enhancements were added. PCP is now part of Red Hat and SuSE Linux Enterprise editions and included in many other Linux distributions. Versions for other Unix variants, OS X and Windows also exist. This session compares popular Open Source Monitoring Tools like Performance Co-Pilot, StatsD, Dropwizard Metrics, Prometeus and Apache Sirona. How they each support Containers or Virtualization, share data with IT monitoring systems like Nagios or Zabbix, or process analyze and visualize it via Carbon, Graphite or Grafana/ElasticSerch.

Speakers
avatar for Werner Keil

Werner Keil

CATMedia UG & Co. KG
Werner Keil is a Cloud Architect, Jakarta EE and Microservice expert for the public sector. Helping Global 500 Enterprises across industries and leading IT vendors. He worked for over 30 years as IT Manager, PM, Coach, SW architect and consultant for Finance, Mobile, Media, Transport... Read More →


Monday November 14, 2016 15:30 - 16:20 CET
Santa Cruz

15:30 CET

Processing Planetary Sized Datasets - Tim Park, Microsoft
In my group at Microsoft, we have worked with the United Nations, Guide Dogs for the Blind in the UK, several automotive companies, and Strí_er on a number of projects involving high scale geospatial data.



In this talk, I'll share some of the best practices and patterns that have come out of those experiences: best practices for storing and indexing geospatial data at scale, incremental ingestion and slice processing of the data, and efficiently building and presenting progressive levels of detail.



The audience will walk away with an understanding of how to efficiently summarize data over a geographic area, general methods for doing ingestion with Apache Kafka (or other event ingestion systems), and incremental updates to large scale datasets with Apache Spark, and best practices around visualizing this data on the frontend.

Speakers
avatar for Tim Park

Tim Park

Software Engineer, Microsoft
Tim is a Principal Software Engineer at Microsoft and works with customers and partners to help them utilize open source platforms on Microsoft’s Azure cloud. He has a particular focus on big data, and, in particular, processing large scale geospatial data. His project experience... Read More →


Monday November 14, 2016 15:30 - 16:20 CET
Carmona

15:30 CET

Moven: Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure - Sergio Fernandez, Redlink GmbH
Modern NLP pipelines use large models that need to be distributed across all the processing infrastructure. For example, in the SSIX project we're managing models of several GBs for the financial sector. At that scale you can't assume the models will be transferred at task submission time, neither manually. From our research, it doesn't look to be any well-accepted approach to solve this issue (e.g., TensorFlow simply uses git).

Moven (models+maven) is a proof-of-concept implemented relying on the Maven infrastructure to publish machine/deep learning models. The current implementation allows to make use of them in both Java and Python. Although we're targeting more specific needs of some concrete environments, such as Apache Spark or Apache Beam Runners API.

Further details at https://bitbucket.org/ssix-project/moven

Speakers
avatar for Sergio Fernández

Sergio Fernández

Software Engineer, Redlink GmbH
I'm a Software engineer specialized in innovation, with a focus on Data Architectures. My interests include Distributed Architectures, Data Integration, Linked Data and System Engineering. I've worked as software engineer and project manager in different industries, but always somehow... Read More →



Monday November 14, 2016 15:30 - 16:20 CET
Nervion/Arenal II/III

15:30 CET

Large Scale SolrCloud Cluster Management via APIs - Anshum Gupta, IBM Watson
Apache Solr is widely used by organizations to power their search platforms and often support multiple users. A lot of cluster management APIs were introduced over the last few releases, allowing the users to to manage operations ranging from replica placement to forcing leader elections via API calls. At the end of this talk, intermediate Solr users would understand what's available, and when can they avoid direct interference with the system, leading to more stable clusters and lower chances of nodes going down. The attendees would also be much better equipped to build their own SolrCloud cluster management tools. I would also talk about when not to use these APIs and what's planned in the near future to handle specific operational use cases.

Speakers
avatar for Anshum Gupta

Anshum Gupta

Sr. Software Engineer, IBM Watson
Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and... Read More →


Monday November 14, 2016 15:30 - 16:20 CET
Giralda V

15:30 CET

Open Source Operations: Building on Apache Spark with InsightEdge, TensorFlow, Apache Zeppelin, and/or Apache - Samuel Cozannet, Canonical
As software becomes more free and open it also is becoming more complex and expensive to operate. How can we as an Open Source community distill best practices, and recommended operations to model complex interconnected services so users can focus on their ideas? How can we as developers deliver recommended best practices in our applications and when connected to other applications so users are free to contribute and use the project on their choice of substrate (laptop, cloud, or bare metal [x86, ARM, ppc64el, s390x]).

In this talk we explore how Juju can provide an Open Source method to model a multi-node Apache Spark cluster across a diverse set of substrates, and start adding other services to build additional solutions. This talk will include a demo, and users should be able to take all software shown to try for themselves in a free and Open Source manner.

Speakers
avatar for Samuel Cozannet

Samuel Cozannet

Strategic Program Manager, Canonical | Ubuntu
GPUs, Deep Learning on Kubernetes


Monday November 14, 2016 15:30 - 16:20 CET
Giralda VI/VII

15:30 CET

Scalable Data Science in R and Apache Spark 2.0 - Felix Cheung, Committer
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? In this talk we will walkthrough many examples how several new features in Apache Spark 2.0.0 will enable this. We will also look at exciting changes coming next in Apache Spark 2.0.1 and 2.1.0.




Speakers
avatar for Felix Cheung

Felix Cheung

Engineering Manager, Uber
Felix started in the big data space about 5 years ago with the then state-of-the-art MapReduce. Since then, he (re-)built Hadoop cluster from metal more times than he would like, created a Hadoop “distro” from two dozens or so projects into .rpm/.deb, and kicked off clusters in... Read More →



Monday November 14, 2016 15:30 - 16:20 CET
Giralda I/II

15:30 CET

Streaming Report: Functional Comparison and Performance Evaluation - Huafeng Wang, Intel
Streaming Report (Mao Wei, Intel) - Streaming processing technology developed so fast recently. Spark Streaming, Flink, Storm, Heron, Gearpump, so many choices are available when people want to pick up the proper one to resolve their real business problems. In this presentation, Mao Wei will go thought all of these different frameworks and compare them in detail. From functional aspect, Wei will discuss underlying mechanism of these frameworks and review several function points which users may care about generally. And from practical aspect, you will see a performance test result based on HiBench, which is a cross platforms micro benchmark suite for big data open sourced by Intel BDT. The test cases include identity, repartition, state operation and window operation.

Speakers
avatar for Huafeng Wang

Huafeng Wang

Software engineer, Vipshop
Huafeng is a software engineer from Intel's Big Data engineering group, as well as a committer of Apache Gearpump, which is an open sourced streaming process engine initiated by Intel.



Monday November 14, 2016 15:30 - 16:20 CET
Nervion/Arenal I

16:30 CET

How Big Data/IoT Leverage the Power of OpenSource to Solve Healthcare Use Cases - Manidipa Mitra, ValueLabs
This session will talk about how a Digital Health Care Mgmt platform can be built (using different open source technologies like Kafka,Spark Streaming,HBase,Hive,pySpark,Mirth)to collect patient data,clinical data(HL7 data),claims data,real-time wearables data and create a 360 view/insights for a patient's health risk and conditions. Also it will talk about how to built a generic platform(by scraping blogs,message board, articles, using an open source called Scrapy.Ingesting fb,twitter data,store,analyse,index,built social-sentiments,create word cloud,segment messages using open source like spark,HBase,Hive,python,Solr)to find out a Key Opinion Leader for a particular disease discussion in social media and how to provide insights/social-sentiments and search capabilities on different medicines used for particular disease/treatment to get feedback on medicines or for research purpose..

Speakers
avatar for Manidipa Mitra

Manidipa Mitra

Director, ValueLabs
Manidipa Mitra heads the Big Data CoE in ValueLabs having extensive experience in building industry specific solution using distributed computing and cloud technologies . Having 16+ years of software industry experience and in-depth knowledge on disruptive-technologies, Cloud and... Read More →


Monday November 14, 2016 16:30 - 17:20 CET
Giralda III/IV

16:30 CET

Interactive Analytics at Scale in Apache Hive Using Druid - Jesús Camacho Rodríguez, Hortonworks
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications. However, it does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries.

Speakers
avatar for Jesús Camacho Rodríguez

Jesús Camacho Rodríguez

Member of Technical Staff, Hortonworks
Jesús Camacho Rodríguez is a Member of Technical Staff at Hortonworks, the PMC chair of Apache Calcite, and a PMC member of Apache Hive. His current work focuses on extending and improving query processing and optimization, ensuring that the increasingly complex workloads supported... Read More →



Monday November 14, 2016 16:30 - 17:20 CET
Nervion/Arenal II/III

16:30 CET

Integrators at Work! Real-Life Applications of Apache Big Data Components - Moderated by Phil Archer, W3C
The event will offer both insights and a hands-on opportunity to learn about and try out the Big Data Platform devised by the BigDataEurope (BDE) project. It's most striking feature is the ease with which many Apache big data components like Apache Spark, Flink, Kafka, Cassandra and Solr can be instantiated through a simple UI thanks to the project's use of Docker containers and Docker Swarm.

BDE is producing 7 pilot instances aligned with the 7 H2020 Societal Challenges (SC), each of which targets a real-world use-case. Things like handling the 2GB of data produced each day by a single typical wind turbine; data mining academic journals and matching the named entities with further information about them including images; tracking changes in land use and matching them with social and professional media feeds. Many of these use cases depend on another key feature of the BigDataEurope platform - the semantification of big data.

Participants will also have the opportunity to shape the next stage of the BDE platform, based on their unique skills and experiences with the Apache technology.

Moderators
avatar for Phil Archer

Phil Archer

Data Strategist, W3C
Phil Archer is the Data Strategist at W3C, the industry standards body for the World Wide Web, coordinating W3C's work in the Semantic Web and related technologies. He is most closely involved in the Data on the Web Best Practices, Permissions and Obligations Expression and S... Read More →

Speakers
AC

Angelos Charalambidis

Researcher, NCSR “Demokritos”
Angelos Charalambidis is a postdoctoral researcher in the Data Engineering Group of the Institute of Informatics at NCSR “Demokritos”. He received his PhD in Programming Languages and his main interests include declarative programming languages, big data systems optimisations... Read More →
avatar for Hajira Jabeen

Hajira Jabeen

Senior Researcher, University of Bonn
She is a work package lead and coordinator for the Big Data Europe.Her research interests are Big Data, Structured Machine Learning, Semantic Web, Data Mining and Evolutionary Computation.
avatar for Axel-Cyrille Ngonga Ngomo

Axel-Cyrille Ngonga Ngomo

Head of Research Group, INFAI
Head of AKSW (http://aksw.org) at University of Leipzig/InfAI, a research group with ca. 50 members. Author of 120+ research papers and 20+ presentations are top-tier conferences. Received manifold research awards including Next Einstein Forum award 2016, 12 best research paper awards... Read More →
SS

Simon Scerri

BDE Deputy Coordinator, Fraunhofer IAIS
Simon Scerri is a senior postdoc in the “Enterprise Information Systems” department at Fraunhofer IAIS and at the University of Bonn. In 2011, Simon received his Ph.D. from the Faculty of Engineering at the National University of Ireland, Galway. Prior to joining Fraunhofer, Simon contributed to research efforts (2005–2013) at the Digital Enterprise... Read More →


Monday November 14, 2016 16:30 - 17:20 CET
Giralda I/II

16:30 CET

SystemML - Declarative Machine Learning - Luciano Resende, IBM
Machine learning in the enterprise is an iterative process. Data scientists will tweak or replace their learning algorithm in a small data sample until they find an approach that works for the business problem and then apply the Analytics to the full data set. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems.Think of SystemML as SQL for Machine Learning, it provides a high-level language to quickly implement and run algorithms, and it also enable cost-based optimizer that takes care of low-level decisions about parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will introduce you to SystemML This talk will introduce you to SystemML and get you started building declarative analytics with SystemML using a Zeppelin notebooks.

Speakers
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for... Read More →


Monday November 14, 2016 16:30 - 17:20 CET
Santa Cruz

16:30 CET

ETL Pipelines with OODT, Solr and Stuff - Tom Barber, Meteorite Consulting
Discover a number of Apache projects you may not have heard of and how they can help you process both Clinical and non Clinical data. Apache OODT developed by NASA allows users to ingest and store files and metadata along with process workflows. OODT along with CTakes allows us to extract clinical information from files and then process them and allow end users access to the extracted data.



We can then take these sources and manipulate them further creating a highly flexible ETL pipeline offering reliability and scalability. Backed by Apache SOLR users can then interrogate the data via web interfaces and instigate further post processing and investigation.



Of course you may not have a clinical use case, but the platforms can be repurposed and will allow you to go away and build your own, scalable data pipeline for processing and integstion.

Speakers
avatar for Tom Barber

Tom Barber

Technical Director, Spicule LTD
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →


Monday November 14, 2016 16:30 - 17:20 CET
Giralda V

16:30 CET

Deep Neural Network Regression at Scale in Spark MLlib - Jeremy Nixon, Spark Technology Center
Deep Neural Network Regression at scale in Spark MLlib - Jeremy Nixon will focus on the engineering and applications of a new algorithm in MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it's trained. Major aspects of that are the compositional transformations over the data, advantages of the various  activation functions, the final linear layer, the cost function and training via backpropagation. Applications will look into how to use neural network regression to model data in computer vision, finance, and the environment. Details around optimal preprocessing, the type of structure that can be found, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face. 

Speakers
avatar for Jeremy Nixon

Jeremy Nixon

Machine Learning Engineer, Spark Technology Center
"I'm a Machine Learning Engineer at the Spark Technology Center, focused on scalable deep learning. I contribute to MLlib at the STC, which I joined after graduating from Harvard College concentrating in Applied Mathematics and Computer Science.


Monday November 14, 2016 16:30 - 17:20 CET
Giralda VI/VII

16:30 CET

Myriad, Spark, Cassandra, and Friends - Big Data Powered by Mesos - Jörg Schad, Mesosphere
Processing Big Data necessitates large compute cluster. And large clusters -especially when running multiple Big Data systems- require some kind of cluster manager and cluster scheduler.

In this talk, we will give an overview how Apache Mesos and DC/OS help solve the problems of large scale clusters and then take a look at the current state of the Big Data ecosystem built on top of this foundation.

We will discuss differences between Apache Yarn and Apache Mesos and why -thanks to Apache Myriad- they are not exclusive choices.

Furthermore, we will look at the growing Big Data ecosystem on top of Apache Mesos and DC/OS including, for example, Apache Spark, Apache Cassandra, and Apache Kafka.

Finally, we will also provide some insights into future developments, both for the foundation (i.e., Apache Mesos and DC/OS) as well as the Big Data ecosystem on top.

Speakers
avatar for Jörg Schad

Jörg Schad

CTO, ArangoDB
Jörg Schad is the CTO at ArangoDB. In a previous life, he has worked on or built machine learning pipelines in healthcare, distributed systems, including early Kubernetes code at Mesosphere, and in-memory databases. He received his Ph.D. for research about distributed databases and... Read More →


Monday November 14, 2016 16:30 - 17:20 CET
Carmona

16:30 CET

Real Time Aggregates in Apache Calcite -- Optimal Use of your Streaming Data - Atri Sharma, Microsoft
The talk shall focus on how to develop applications in real time analytics space using Apache Calcite's advanced query planning capabilities. The talk shall give a small overview of Calcite's planner and rules engine and then proceed to discuss the capabilities that can be used to develop real time applications that continuously stream data and process them. The talk shall be discussing the ongoing work in Calcite's framework and the upcoming streaming aggregation features that will be present soon. The talk shall also focus on Calcite's highly adaptable framework that allows Calcite to work with many existing projects and how your current application can take advantage of Calcite' s planning and aggregation capabilities.

Speakers
avatar for Atri Sharma

Atri Sharma

SDE-II, Microsoft
A distributed systems engineer, committer on Apache Apex, PMC Member on Apache MADLib, PPMC Member on Apache HAWQ and major contributor in PostgreSQL Project, having implemented GROUPING SETS, ROLLUP, CUBE and Ordered Set Aggregates


Monday November 14, 2016 16:30 - 17:20 CET
Nervion/Arenal I

17:30 CET

BoF Space Available - Book Now! (Space is Limited)
Are you passionate about a topic and want to share that with others? If so, sign up to lead a Birds of a Feather (BoF) session. Instead of passive listening, all attendees and organizers are encouraged to become participants, with discussion leaders providing moderation and structure for attendees. To sign up for a BoF Session, please book through the form. You will select the time and then be prompted to enter your BoF details.

Monday November 14, 2016 17:30 - 18:30 CET
TBA

17:30 CET

BoF: Apache Beam and You! - Jean-Baptiste Onofré, Talend & ASF and Dan Halperin, Google
Apache Beam is a unified programming model for big data processing. Especially Beam is composed around three visions and kind of users: if the end users are the pipeline writers, the SDK & DSL writers, and the runner writers can be you, as contributor to other Apache projects! 

Speakers
DH

Daniel Halperin

Google
Dan Halperin is a PMC member of Apache Beam. He has worked on Beam and Google Cloud Dataflow for 2 years. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in... Read More →
JO

Jean-Baptiste Onofré

Talend
JB is PMC member for Apache Beam. He is a long-tenured Apache member, serving on PMC/committer for about 15 projects that range from integration to big data.


Monday November 14, 2016 17:30 - 18:30 CET
Nervion/Arenal II/III

17:30 CET

BoF: Apache Branding Policies & Trademarks - Shane Curcuru, Apache Software Foundation
Come meet Shane, VP, Brand for the ASF, to ask your questions about how Apache projects are branded and how you can respect Apache trademarks!
https://s.apache.org/trademarks

Speakers
avatar for Shane Curcuru

Shane Curcuru

Founder, Punderthings Consulting
Shane serves as V.P. of Brand Management for the ASF, setting trademark and brand policy for all 250+ Apache projects, and has served as five-time Director, and member and mentor for Conferences and the Incubator. Shane's Punderthings consultancy is here to help both companies and... Read More →


Monday November 14, 2016 17:30 - 18:30 CET
Nervion/Arenal I

17:30 CET

BoF: Apache Way as a Cultural Template - Dzmitry Pletnikau, Unicity Intl
Organizations operate within the framework of rules: written or implicit. These rules form the "culture". Manager's or executive's job is to shape and guide the evolution of these rules. Can Apache Way be used as a drop-in cultural template in any organization producing intellectual assets?

Speakers
DP

Dzmitry Pletnikau

Unicity Intl
Growing up in Belarus, Dzmitry developed an early interest in natural sciences and computer programming. After receiving a degree in Physics, Dzmitry chose to freelance as a programmer. Five years later he settled as a Software Architect at Unicity, in Utah. Dzmitry is responsible... Read More →



Monday November 14, 2016 17:30 - 18:30 CET
Santa Cruz

17:30 CET

BoF: Open Source Beyond Software - Alexander Bezzubov, NF Labs
Open source software in general, and Apache Software Foundation in particular is a great example of how principles below have changed the whole industry:  
  • Permissive licensing
  • Open governance 
  • Distributed networks of collaborators
  • Work, guided by one's desire 
Same principles begin to be are applied to other aspects of life by different communities around the globe
  • Open hardware 
  • Makers 
  • Publishing
  • DIYbio
  • Housing 

As well as some more traditional cultural phenomenon similar in spirit:
  • Shanzhai (Chinese: 山寨) 
  • Kibbutz (Hebrew: קִבּוּץ / קיבוץ) 

Let's explore existing initiative and see where it can lead us together!

Speakers
AB

Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.


Monday November 14, 2016 17:30 - 18:30 CET
Carmona
 
Tuesday, November 15
 

07:00 CET

Morning Run
Come meet in the lobby of the Melia Sevilla at 7:00am for a morning run. The plan is to cross to the park and run next to the river. 

This will last an hour and the group will be back by 8:00am.

Tuesday November 15, 2016 07:00 - 08:00 CET
Melia Sevilla Hotel Lobby

08:30 CET

Breakfast
Tuesday November 15, 2016 08:30 - 09:30 CET
Giralda Foyer

08:30 CET

Sponsor Showcase
Tuesday November 15, 2016 08:30 - 15:30 CET
Triana Foyer

08:30 CET

Registration
Tuesday November 15, 2016 08:30 - 18:00 CET
Triana Foyer

09:30 CET

Keynote: Hadoop Infrastructure @Uber Past, Present and Future - Mayank Bansal, Sr. Engineer, Uber
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 1000 and In future with help of Apache Mesos, Myriad and Hadoop to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads).

Speakers
avatar for Mayank Bansal

Mayank Bansal

Staff Engineer, Uber
Mayank Bansal is currently working as a Staff engineer at Uber in data infrastructure team. He is co-author of Peloton. He is Apache Hadoop Committer and Oozie PMC and Committer. Previously he was working at ebay in hadoop platform team leading YARN and MapReduce effort. Prior to... Read More →



Tuesday November 15, 2016 09:30 - 09:50 CET
Giralda I/II

09:55 CET

Keynote: The ASF's Big Tent - Sean Owen, Director of Data Science, Cloudera
The ASF is going stronger than ever: more projects, contributors, corporations under an increasingly big tent. While the ASF facilitates software development on its surface, the ASF is more than just a Github. Its collaboration model and people drive success and longevity of projects and the foundation. Together in person we should strengthen community bonds.

Speakers
avatar for Sean Owen

Sean Owen

Director of Data Science, Cloudera
Sean is Director of Data Science, based in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time learning on Hadoop. He is on the Apache Spark PMC and co-author of Advanced Analytics with Spark. Previously, Seanwas a senior... Read More →


Tuesday November 15, 2016 09:55 - 10:10 CET
Giralda I/II

10:10 CET

Coffee Break
Tuesday November 15, 2016 10:10 - 11:00 CET
Giralda Foyer

11:00 CET

Real Use Cases of Kappa Architecture - Juantomas Garcia, Open Sistemas
During the lasts three years we use kappa architecture in almost all our projects. We want to show how kappa architecture fixed in very different size projects. Kappa architecture is not the silver bullets for every project but is very likely ...

Speakers
avatar for Juantomas Garcia

Juantomas Garcia

Data Solutions Manager, Open Sistemas
President Hispalinux (Spanish User Local Group) (1999-2007) Author of the book "La Pastilla Roja" the first book in spanish about free software (2004) More than 200 lectures around the world. Now CDO of Open Sistemas and advocate of Apache Spark and Kappa Architecture. Organize of... Read More →


Tuesday November 15, 2016 11:00 - 11:50 CET
Nervion/Arenal I

11:00 CET

Building a Robust Analytics Platform with an Open-Source Stack - Dao Mi & Alex Kass, DigitalOcean
As modern enterprises migrate to microservice-centric cloud architecture, it has become imperative to build a new data analysis framework to handle äóñ often in real time - the event-based data these services produce. For this presentation, we will demonstrate how to leverage multiple open source projects to build a robust framework quickly and cheaply that can scale with an organization as it grows and inexorably generates more and more data.

They will cover a tangible, real-world, implementation that includes Apache technologies such as Kafka, Mesos, and Spark, as well as open-source PrestoDB (Facebook).



The speakers will discuss lessons learnt during and after the build, as well as

some specific use-cases for how this approach brought about otherwise-unattainable actionable business insights and results, including hardware failure prediction and capacity planning.

Speakers
avatar for Alex Kass

Alex Kass

Engineering Manager, DigitalOcean
Alex Kass has worked at companies ranging from large financial institutions to early-stage startups, regularly building successful analytical models and systems of varying size. At DigitalOcean, a fast-growing global cloud hosting provider, he has at his disposal sufficient software... Read More →
avatar for Dao Mi

Dao Mi

Data Engineer, Digital Ocean
Dao Mi has extensive experience working with data of different scale, type and velocity across myriad industries, from natural gas to floating bonds. While at Microsoft, he helped deliver BI and predictive analytics solutions to Fortune 500 clients. He has helped build a large custom... Read More →


Tuesday November 15, 2016 11:00 - 11:50 CET
Giralda V

11:00 CET

Crawling the Web for Common Crawl - Sebastian Nagel, Common Crawl
Common Crawl is non-profit organization which regularily crawls a significant sample of the web and makes the data accessible free charge to everyone interested in running machine-scale analysis on web data. The presentation will demonstrate how to use the Common Crawl data covering data formats and tools as well as examples and derived datasets. The monthly crawls are run by Apache Nutch on Apache Hadoop. Sebastian will also share his experience from running a web-scale crawl on a small budget.

Speakers
avatar for Sebastian Nagel

Sebastian Nagel

Crawl Engineer, commoncrawl.org
Sebastian Nagel works as crawl engineer at Common Crawl, a non-profit organization that makes web data freely accessible to everyone. Prior to joining Common Crawl he implemented search and data quality solutions at Exorbyte. Sebastian is a committer and PMC of Apache Nutch, a scalable... Read More →



Tuesday November 15, 2016 11:00 - 11:50 CET
Giralda III/IV

11:00 CET

Apache HBase: Overview and Use Cases - Apekshit Sharma, Cloudera
NoSQL databases are critical in building Big Data applications. Apache HBase, one of the most popular NoSQL databases, is used by Facebook, Apple, eBay and hundreds of other enterprises to store, analyze and profit from their petabyte-scale volume of data. This talk will discuss

- motivation behind NoSql databases

- basic architecture of a popular NoSql system, Apache HBase

- some commonly seen big data usage patterns in industry, and when & how to use Apache HBase (or other better suited NoSQL database).

Speakers
AS

Apekshit Sharma

Software Engineer, Cloudera Inc
Apekshit Sharma (Appy) is a Software Engineer at Cloudera, and contributor of Apache HBase. Prior, he was at Google building backend infrastructure using Map-Reduce, Bigtable & Millwheel. He earned his B.Tech in Computer Science from Indian Institute of Technology, Bombay. Currently... Read More →


Tuesday November 15, 2016 11:00 - 11:50 CET
Carmona

11:00 CET

Native and Distributed Machine Learning with Apache Mahout - Suneel Marthi, Red Hat
Data scientists love tools like R and Scikit-Learn since they are declarative and offer convenient and intuitive syntax for analysis tasks but are limited by local memory, Mahout offers similar features with near seamless distributed execution.

In this talk, we will look at Mahout-Samsara's distributed linear algebra capabilities and demonstrate the same by building a classification algorithm for the popular 'Eigenfaces' problem using the Samsara DSL from an Apache Zeppelin notebook. We will demonstrate how a simple classification algorithm may be prototyped and executed, and show the performance using Samsara DSL with GPU acceleration. This will demonstrate how ML algorithms built with Samsara DSL are automatically parallelized and optimized to execute on Apache Flink and Apache Spark without the developer having to deal with the underlying semantics of the execution engine.

Speakers
avatar for Suneel Marthi

Suneel Marthi

AWS
Suneel is a Member of Apache Software Foundation and is a Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams. He's presented in the past at Flink Forward, Hadoop Summit, Berlin Buzzwords, Machine Learning Conference, Big Data Tech Warsaw and Apache Big Data.


Tuesday November 15, 2016 11:00 - 11:50 CET
Santa Cruz

11:00 CET

Smart Manufacturing with Apache Spark Streaming and Deep Learning - Prajod Vettiyattil, Wipro Technologies
Even after a century of the Industrial Revolution, manufacturing processes even within assembly lines, involve manual steps requiring costly human intervention. Eg:Product quality inspection. With the advent of machine learning and big data tools, it has become possible to automate many of these manual processes. What is more, such solutions can surpass human capability for manual quality inspection. In this session we will look at a few examples of how products on assembly lines can be monitored for quality, using image processing techniques combined with machine learning. The solution to be presented, is built using a combination of machine learning and deep learning techniques running on Apache Spark Streaming.

The presentation will also explain the steps involved in creating such a solution: mapping a business need to a ML based technical solution

Speakers
avatar for Prajod Vettiyattil

Prajod Vettiyattil

Architect, Wipro
Prajod is a Senior Architect in the open source solutions group of Wipro Technologies, responsible for research and solution development in the area of Big Data and Analytics. His current work involves analyzing image and video content using machine learning, to solve hard problems... Read More →


Tuesday November 15, 2016 11:00 - 11:50 CET
Giralda VI/VII

11:00 CET

Introducing Apache CouchDB 2.0 - Jan Lehnardt, Neighbourhoodie Software
A thorough introduction to CouchDB 2.0, the five-years-in-the-making final delivery of the larger CouchDB vision.



Apache CouchDB 2,0 finally puts the C back in C.O.U.C.D.B: Cluster of unreliable commodity hardware. With a production-proofed implementation of the Amazon Dynamo paper, CouchDB has now high-availability, multi-machine clustering as well scaling options built-in, making it ready for Big Data solutions that benefit from CouchDBäó»s unique multi-master replication.


Speakers
avatar for Jan Lehnardt

Jan Lehnardt

CEO, Neighbourhoodie Software
Jan Lehnardt is the PMC Chair and VP of Apache CouchDB, co-creator of the Hoodie web app framework based on CouchDB as well as the founder and CEO of Neighbourhoodie Software. He’s the longest standing contributor to Apache CouchDB.


Tuesday November 15, 2016 11:00 - 11:50 CET
Nervion/Arenal II/III

11:00 CET

A Java Implementer's Guide to Boosting Apache Spark Performance - Tim Ellison, IBM
Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.

Speakers
avatar for Tim Ellison

Tim Ellison

Tim Ellison is currently a Senior Technical Staff Member with IBM's Java Technology Centre in the UK. He has worldwide responsibility for Open Source Engineering in the Java SDK underpinning a broad selection of IBM's flagship products. He is a Member of the Apache Software Foundation... Read More →


Tuesday November 15, 2016 11:00 - 11:50 CET
Giralda I/II

12:00 CET

Apache Ignite - Path to Converged Data Platform - Dmitriy Setrakyan, GridGain
Apache Ignite is one of the fastest growing apache projects. The presentation will take the audience on a roadmap discovery of Ignite moving to a converged storage model, supporting both, analytical and transactional data sets. We will go over the differences between Fast Data and Big Data and cover the projects supporting both technologies. We will discuss the reasons, real-life use cases and technology approaches for merging Fast Data and Big Data in order to deliver a consistent & universal data processing platform regardless of where data resides relative to HDD, flash or DRAM.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior... Read More →


Tuesday November 15, 2016 12:00 - 12:50 CET
Giralda V

12:00 CET

Apache Spark: Enterprise Security for Production Deployments - Vinay Shukla, Hortonworks
Spark is being deployed in production by many enterprises. With enterprise traction comes enterprise security requirements and the need to meet enterprise security standards.



The sessions walks through enterprise security requirements, provides deep of dive Spark security features and shows how Spark meets these enterprise security requirements.



The talks go on uncovering the entire gamut of security in Spark from Kerberos, Authentication, Authorization, Audit to Encryption with Spark. The session will provide deep dive on all existing security features in Spark and will also outline to future security work planned in the Apache Spark community.

Speakers
VS

Vinay Shukla

Director, Product Management, Hortonworks
Vinay Shukla is the Director of Product Management for Spark & Zeppelin at Hortonworks. Previously, Vinay has worked as Developer and Security Architect. Vinay has given talks at Hadoop Summit (2x), Apache Con Big Data - Europe (2015), JavaOne & Oracleworld. His most recent talk was... Read More →


Tuesday November 15, 2016 12:00 - 12:50 CET
Nervion/Arenal I

12:00 CET

Create a Hadoop Cluster and Migrate 39PB Data Plus 150000 Jobs/Day - Stuart Pook, Criteo
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and >100000 jobs per day. This cluster was critical in both stockage and compute but without backups. This talk describes: 0/ the different options considered when deciding how to protect our data and compute capacity 1/ the criteria established for the 800 new computers and comparison tests between suppliers' hardware 2/ the non-blocking network infrastructure with 10 Gb/s endpoints scalable to 5000 machines 3/ the installation and configuration, using Chef, of a cluster on new hardware 4/ the problems encountered in moving our jobs and data from the old CDH4 cluster to the new CDH5 cluster 600 km distant 5/ running and feeding with data the two clusters in parallel 6/ fail over plans 7/ operational issues 8/ the performance of the 16800 core, 200 TB RAM and 60 PB disk CDH5 cluster.

Speakers
avatar for Stuart Pook

Stuart Pook

Senior DevOps Engineer, Criteo
Stuart loves storage (130 PB at Criteo) and is part of Criteo's Lake team that runs some small and two rather large Hadoop clusters. He also loves automation with Chef because configuring more than 2200 Hadoop nodes by hand is just too slow. Before discovering Hadoop he developed... Read More →



Tuesday November 15, 2016 12:00 - 12:50 CET
Giralda VI/VII

12:00 CET

The Original Vision of Nutch, 14 Years Later: Building an Open Source Search Engine - Sylvain Zimmer, Common Search
Few people remember that before spinning off Hadoop and focusing on crawling, Nutch was meant to be an alternative to commercial search engines. What if we tried to do it again today?



In this presentation, Sylvain Zimmer will explain how he used projects from the Nutch diaspora like Spark and Elasticsearch to build Common Search, an open source search engine with transparent rankings.



We will go over the architecture of large-scale search engines and how it has evolved since the late 90s. Then we will review the tools from the Apache and open source ecosystems that are best suited to solve the many challenges at hand. Finally, we will discuss what lies ahead for Common Search before it can be useful to the general public.

Speakers
SZ

Sylvain Zimmer

Founder, Common Search
Sylvain Zimmer is a software developer and longtime free culture advocate. In 2004 he founded Jamendo, the largest Creative Commons music community online. Since 2012, he has been the CTO of Pricing Assistant, a startup specialized in large-scale crawling of E-commerce websites. He... Read More →


Tuesday November 15, 2016 12:00 - 12:50 CET
Giralda III/IV

12:00 CET

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL Data Systems - Christian Tzolov, Pivotal
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.

To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.

We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.

Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.

Speakers
avatar for Christian Tzolov

Christian Tzolov

Pivotal Inc
Christian Tzolov, Pivotal technical architect, BigData and Hadoop specialist, contributing to various open source projects. In addition to being an Apache® Committer and Apache Crunch PMC Member, he has spent over a decade working with various Java and Spring projects and has led... Read More →



Tuesday November 15, 2016 12:00 - 12:50 CET
Nervion/Arenal II/III

12:00 CET

Using Apache Spark for Generating ElasticSearch Indices Offline - Andrej Babolcai, ESET
Making historical data available for searching can be a challenge, especially if you have a lot of it. Indexing data to a live cluster can degrade search performance and having a spare cluster where you index your data can be expensive. In this talk we present the approaches we tried and describe an approach to create ElasticSearch indices offline using Apache Spark. When created, these indices are then stored as snapshots in HDFS and can then be restored to a running ElasticSearch cluster. Snapshots in HDFS also work as a backup, ready to restore solution in case of an error.

Speakers
AB

Andrej Babolcai

Software Engineer, Eset
Software Engineer at ESET Currently working with Big Data technologies at ESET. Responsible for collecting and storing and making data available for end users. Previously worked at Honeywell. Speaking experience: Caro workshop 2016 (http://2016.caro.org/)


Tuesday November 15, 2016 12:00 - 12:50 CET
Giralda I/II

12:00 CET

Building Streaming Applications with Apache Apex - Thomas Weise & Chinmay Kolhatkar, DataTorrent
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

Speakers
CK

Chinmay Kolhatkar

Chinmay is Software Engineer at DataTorrent Software, India and committer on the Apache Apex project.
avatar for Thomas Weise

Thomas Weise

CTO, Atrato.io
Thomas is Apache Apex PMC Chair and CTO at Atrato. Prior to founding Atrato he was Architect at DataTorrent and lead the development of Apex from the beginning of the project. Before that he was member of the Hadoop Team at Yahoo! and contributed to several of the big data ecosystem... Read More →



Tuesday November 15, 2016 12:00 - 12:50 CET
Carmona

13:00 CET

Large Scale Open Source Data Processing Pipelines at Trivago - Clemens Valiente, Trivago
trivago is processing roughly 7 billion events per day with an architecture that is entirely open source - from producing the data until its visualization in dashboards and reports. This talk will explain the idea behind the pipeline, highlight a particular business use case and share the experience and engineering challenges from two years in production. Clemens Valiente will furthermore show the different tools, frameworks and systems used, with Kafka for data ingestion, hadoop and Hive for processing and Impala for querying as the main focus. The successful implementation of this large scale data processing pipeline fundamentally transformed the way trivago was able to approach its business.

Speakers
avatar for Clemens Valiente

Clemens Valiente

Lead Data Engineer, trivago GmbH
I'm part of trivago's Data Engineering team where we are running a data processing pipeline through kafka, hadoop, impala and R processing roughly 7 billion events per day. Our hadoop cluster is central for BI dashboards, reports, ad hoc analyses, personalisation, bidding and recommendation... Read More →


Tuesday November 15, 2016 13:00 - 13:50 CET
Giralda VI/VII

13:00 CET

Massively Parallel Data Warehousing in the Hadoop Stack - Gregory Chase & Roman Shaposhnik, Pivotal
Hadoop has been touted as a replacement for data warehouses.  In practice Hadoop has had success offloading ETL/ELT workloads, but still has gaps serving requirements for operational analytics.

Apache Bigtop now includes Greenplum Database in deployment of big data solutions. Greenplum Database is, an open source massively parallel data warehouse  based on PostgreSQL, and is an excellent addition to the Hadoop ecosystem.

In this session we'll cover:
  • Introduction to Greenplum 
  • Bigtop Support for Greenplum
  • External tables in Hadoop by Greenplum
  • Parallel reads and writes to Hadoop by Greenplum
  • Running advanced analytics on structured and unstructured data in both Hadoop and Greenplum via Apache MADlib (incubating)
  • Geospatial and Machine Learning in Greenplum based on HDFS data
  • Storing data from a data lake in Greenplum for high throughput analytical queries

Speakers
GC

Gregory Chase

Director Product Marketing, PagerDuty
Greg Chase is Director of Product Marketing for PagerDuty Automation and Rundeck. He's been in marketing and engineering in software companies for too many decades, evangelizing and building automation platforms, developer tools and data engineering frameworks. Before PagerDuty, Greg... Read More →
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels


Tuesday November 15, 2016 13:00 - 13:50 CET
Nervion/Arenal I

13:00 CET

What 50 Years of Data Science Leaves Out - Sean Owen, Cloudera

We're told "data science" is the key to unlocking the value in big data, but, nobody seems to agree just what it is -- engineering, statistics, both? David Donoho's paper "50 Years of Data Science" offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it's anything at all. This talk will examine these points, and respond with an engineer's counterpoints, in search of a better understanding of data science.


Speakers
avatar for Sean Owen

Sean Owen

Director of Data Science, Cloudera
Sean is Director of Data Science, based in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time learning on Hadoop. He is on the Apache Spark PMC and co-author of Advanced Analytics with Spark. Previously, Seanwas a senior... Read More →


Tuesday November 15, 2016 13:00 - 13:50 CET
Estepa

13:00 CET

Scalable Private Information Retrieval: Introducing Apache Pirk (incubating) - Ellison Anne Williams, Creator of Apache Pirk
Querying information over TBs of data where no one can see what you query or the responses obtained? It sounds like science fiction, but it is actually the science of Private Information Retrieval (PIR). This talk will introduce Apache Pirk - a new incubating Apache project designed to provide a framework for scalable, distributed PIR. We will discuss the motivation for Apache Pirk, its distributed implementations in platforms such as Spark and Storm, itäó»s current algorithms, the power of homomorphic encryption, and take a look at the path forward.

Speakers
EA

Ellison Anne Williams

Ellison Anne Williams is a creator and PMC member of Apache Pirk, a pure mathematician by training, and a practical computer scientist in real life. Her passion is doing cool stuff with massive amounts of data.


Tuesday November 15, 2016 13:00 - 13:50 CET
Carmona

13:00 CET

SASI, Cassandra on the Full Text Search Ride! - DuyHai Doan, Datastax
Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities.



Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE '%term%'` syntax.



To illustrate how SASI works, we'll use a database of 100 000 albums and artists. We'll also show how SASI can help to accelerate analytics scenarios with Apache Spark using SparkSQL predicate push-down.



We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic, sorry)


Speakers
avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax and committer for Apache Zeppelin. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies... Read More →


Tuesday November 15, 2016 13:00 - 13:50 CET
Nervion/Arenal II/III

13:00 CET

Elastic Spark Programming Framework - A Dependency Injection Based Programming Framework for Spark Applications - Bruce Kuo
Apache Spark is the hottest computing engine nowadays.

More people in big data industry start to use Spark in production systems,

such as machine learning and data ETL applications.

However, developers take quality of code seriously in production systems.

In the past experience, we find there is a gap between development and production.

The difficulties we meet when developing Spark applications in production systems are:

(1) hard to communicate with components,

(2) indirect management of application arguments,

(3) inadequate code maintainability.

To solve these problems and make development smoother, we propose

a dependency-injection-based programming framework on JVM systems.

It provides basic management, monitoring, and better communication mechanisms.

The huge flexibility can help developers writing Spark applications and integrating with components in a better manner.

Speakers
avatar for Bruce Kuo

Bruce Kuo

Software Engineer, Yahoo!
Chun-Ting Kuo (Bruce) works at Yahoo as a data engineer, and he dedicates his work on developing data products and scientific applications. His experience covers Spark, Hadoop, algorithms, and a little machine learrning. When he is free, he loves to code and know novel techniques... Read More →


Tuesday November 15, 2016 13:00 - 13:50 CET
Giralda I/II

13:00 CET

Power Pig with Spark - Liyun Zhang, Intel
Apache Pig is a popular scripting platform for processing and analyzing large data sets in the Hadoop ecosystem. With its open architecture and backend neutrality, Pig scripts can currently run on MapReduce and Tez. Apache Spark is an open-source data analytics cluster computing framework that has gained significant momentum recently. Besides offering performance advantages, Spark is also a more natural fit for the query plan produced by Pig. Pig on Spark enables improved ETL performance while also supporting users intending to standardize to Spark as the execution engine.

Speakers
LZ

Liyun Zhang

Software Engineer, Intel
Liyun Zhang is a Software Engineer at Intel. She is one of main contributors of Pig on Spark project. Prior to that, she made several contributions to Intel Distribution for Hadoop.


Tuesday November 15, 2016 13:00 - 13:50 CET
Giralda V

13:00 CET

Sparkler - Crawler on Apache Spark - Karanjeet Singh & Thamme Gowda, USC
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this presentation, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. GitHub Link - https://github.com/USCDataScience/sparkler

Speakers
avatar for Thamme Gowda

Thamme Gowda

Graduate Student, University of Southern California
Thamme Gowda is a grad student at the Univ. of Southern California, Los Angeles, CA, and also an intern at NASA Jet Propulsion Laboratory, Pasadena, CA, USA. He is a co-founder of Datoin.com, a software as a service platform built using Hadoop and Spark. He is also a committer and... Read More →
avatar for Karanjeet Singh

Karanjeet Singh

Research Assistant, University of Southern California
He is pursuing his Master's degree in Computer Science from the University of Southern California (USC). His projects and research are mostly from the area of Information Retrieval and Data Science. He is also affiliated with NASA Jet Propulsion Lab. Prior to this, he was working... Read More →



Tuesday November 15, 2016 13:00 - 13:50 CET
Giralda III/IV

13:50 CET

14:00 CET

Women in Big Data Luncheon & Program
Limited Capacity seats available

On behalf of Women in Big Data, we'd like to invite you to a luncheon/meetup event taking place Tuesday, November 15.

This luncheon is open to women and allies who are interested in attending to network and collaborate with other like-minded individuals, with the ultimate goal of strengthening and increasing diversity in the big data community. The luncheon is free to attend, but space is limited. Please RSVP here if you'd like to attend.


Luncheon Agenda



  • 1:50pm - WiBD Overview – Anna Marchon

  • 2:00pm - Keynote: Tina Rosario, Global VP, Enterprise Data Management at SAP 

  • 2:30pm Keynote: Marina Alekseeva, GM of the Intel Software and Service Group in Russia

  • 3:00pm - Networking


 

 A big thank you to our lunch sponsor, Women in Big Data. For more details on WiBD and how to get involved, visit https://www.womeninbigdata.org/

Speakers
avatar for Marina Alekseeva

Marina Alekseeva

Director of Software Product Services, Intel SSG
Marina Alekseeva is the General Manager of the Intel Software and Service Group (SSG) in Russia and the Director of Software Product Services, a multinational multifunctional team which provides complete solutions and infrastructure for software production and product delivery. Member... Read More →
avatar for Tina Rosario

Tina Rosario

Tina Rosario - Global Vice President, Enterprise Data Management, SAP
Tina Rosario is a business strategy professional with over 25 years of experience in IT, business process re-engineering, change management and enterprise data management. During her 12 years at SAP, Tina has held executive positions in business operations, consulting services and... Read More →


Tuesday November 15, 2016 14:00 - 15:10 CET
Santa Cruz

15:30 CET

Low Latency Web Crawling on Apache Storm - Julien Nioche, DigitalPebble Ltd.
StormCrawler is an open source collection of resources, mostly implemented in Java, for building low-latency, scalable web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what StormCrawler provides, we will compare it with similar projects like Apache Nutch and present several real life use cases. In particular we will see how StormCrawler can be used with ElasticSearch and Kibana for crawling and indexing web pages and also monitor the crawl itself.

Speakers
avatar for Julien Nioche

Julien Nioche

Director, DigitalPebble Ltd
I run DigitalPebble Ltd, a consultancy based in Bristol, UK and specialising in open source solutions for text engineering. My expertise covers web crawling, natural language processing, machine learning and search. I am a committer on Apache Nutch and am also involved in several... Read More →


Tuesday November 15, 2016 15:30 - 16:20 CET
Giralda III/IV

15:30 CET

Your Datascience Journey with Apache Zeppelin - Moon soo Lee, Anthony Corbacho & Jongyoul Lee, NFLabs
Take a journey together to see how Apache Zeppelin started, how Apache Zeppelin helps your data science lifecycle, how Apache Zeppelin became popular TLP project. We'll also see how community focus has been changed, from basic notebook feature, spark integration to advanced features like multi-tenancy. Lee moon soo will explain value of Apache Zeppelin with some key use case scenario demo. Also we'll see eco-system around it - How various projects and companies are using Apache Zeppelin in their product and services in many different ways.

Finally, we'll discuss about Apache Zeppelin's future roadmap with some challenges that community have.

Speakers
avatar for Jongyoul Lee

Jongyoul Lee

Software Development Engineer, ZEPL
I'm a member of PMC of Apache Zeppelin and works at ZEPL. In Apache Zeppelin, I focus on stabilizing Apache Zeppelin to be used in production level, developing some enterprise features and enhancing Apache Spark/JDBC features. Personally, I'm really interested in distributed and fault-tolerant... Read More →
avatar for Moon

Moon

cto, NFLabs
Moon soo Lee is a creator for Apache Zeppelin and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and getting adoptions.


Tuesday November 15, 2016 15:30 - 16:20 CET
Carmona

15:30 CET

AMIDST Toolbox: A Java Toolbox for Scalable Probabilistic Machine Learning - Andres Masegosa, NTNU
We would like to present our open source AMIDST toolbox for analysis of large-scale data sets using probabilistic machine learning models. AMIDST runs algorithms in a distributed fashion for learning a wide range of latent variable models such as Gaussian mixtures, (probabilistic) principal component analysis, Hidden Markov Models, Kalman Filter, Latent Dirichlet Allocation, etc. This toolbox is able to learn any user-defined probabilistic (graphical) model with billions of nodes using novel message passing algorithms.



We plan to give an overview of the AMIDST toolbox, some details about the API and the integration with Flink, Spark (and other open source tools) and an analysis of the scalability of our learning algorithms. All this in the context of a real use case scenario in the financial domain (BCC group), where millions of customers profiles are analyzed.

Speakers
avatar for Andres Masegosa

Andres Masegosa

Phd, NTNU
I am a research fellow at NTNU (Norway) with broad interests in data mining and machine learning using probabilistic graphical models. Lately, my research has focused on scalable machine learning methods for solving real use cases in the financial (BCC group) and automotive industry... Read More →


Tuesday November 15, 2016 15:30 - 16:20 CET
Estepa

15:30 CET

Classifying Unstructured Text - Deterministic and Machine Learning Approaches - Christian Winkler & Stephanie Fischer, mgm Technology Partners GmbH
Text is one of the most used forms of communication and ubiquitous in the Internet. Social networks like Facebook and Twitter mainly contain unstructured text; the same is true for content-driven websites.



For humans it is easy to grasp the meaning of text - much more difficult for computers. Used correctly, computers can help humans tremendously in structuring and classifying huge amounts of text. This "symbiosis" can help humans work more efficiently, reduce repetitve work and use the uncovered structure.



Our talk starts with visualizations giving us ideas how to automatically classify texts. Then we will demonstrate that manual intervention is sometimes necessary and how this can be used as a basis for machine learning. This helps significantly in classifying more complicated cases.



As software tools we use R, Apache Solr, D3.js, and several NLP and ML tools from the ASF.

Speakers
avatar for Stephanie Fischer

Stephanie Fischer

Big Data, Agile and Change Management, mgm consulting partners
I concentrate on user-centricity of Big Data technologies. My focus is finding the questions really worth solving. I think Big Data has the potential to advance humanity into a desirable direction. I have a background in organizational development, agility and business analytics... Read More →
avatar for Christian Winkler

Christian Winkler

Enterprise architect, mgm technology partners GmbH
Christian has worked for 20 years with Internet technologies. Recently, he has focused on working with large amounts of data or many users. As big data applications become more and more popular, lots of applications evolve. Many aggregates have to be calculated to describe charcteristics... Read More →


Tuesday November 15, 2016 15:30 - 16:20 CET
Giralda V

15:30 CET

User Defined Functions and Materialized Views in Cassandra 3.0 - DuyHai Doan, Datastax
Cassandra is evolving at a very fast pace and keeps introducing new features that close the gap with traditional SQL world, but they are always designed with a distributed approach in mind.



First we'll throw an eye at the recent user-defined functions and show how they can improve your application performance and enrich your analytics use-cases.



Next, a tour on the materialized views, a major improvement that drastically changes the way people model data in Cassandra and makes developers' life easier!

Speakers
avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax and committer for Apache Zeppelin. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies... Read More →


Tuesday November 15, 2016 15:30 - 16:20 CET
Nervion/Arenal II/III

15:30 CET

Building and Running a Solr-as-a-Service for IBM Watson - Shai Erera, IBM
Running a managed Solr service brings fun challenges with it, to both the users and the service itself. Users typically do not have access to all components of the Solr system (e.g. the ZK ensemble, the actual nodes that Solr runs on etc.). On the other hand the service must ensure high-availability at all times, and handle what is often user-driven tasks such as version upgrades, taking nodes offline for maintenance and more.



In this talk I will describe how we tackle these challenges to build a managed Solr service on the cloud, which currently hosts few thousands of Solr clusters. I will focus on the infrastructure that we chose to run the Solr clusters on, as well how we ensure high-availability, cluster balancing and version upgrades.

Speakers
avatar for Shai Erera

Shai Erera

STSM, Social Analytics & Technologies, IBM
Shai Erera is a Researcher at IBM Research, Haifa, Israel. Shai earned his M.Sc in Computer Science from the University of Haifa in 2007. Shai’s work experience includes the development of search-based systems over Lucene and Solr and he is also a Lucene/Solr committer.


Tuesday November 15, 2016 15:30 - 16:20 CET
Giralda VI/VII

15:30 CET

Getting Started Contributing to Apache Spark - Holden Karau, IBM
Apache Spark is one of the most popular tools for big data, and with over 400 open pull requests as of this writing very active in terms of development as well. With such a large volume of contributions, it can feel difficult to started contributing to Apache Spark. This talk is developer focused and will walk through how to find good issues to start with, formatting code, finding reviewers, and what to expect in the code review process. We will also talk about alternatives to contributing to Apache Spark directly (such as creating packages).

Speakers
avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning... Read More →


Tuesday November 15, 2016 15:30 - 16:20 CET
Giralda I/II

15:30 CET

Implementing BigPetStore in Spark and Flink - Márton Balassi, Cloudera
Implementing use cases on unified data platforms. Having a unified data processing engine empowers Big Data application developers as it makes connections between seemingly unrelated use cases natural. This talk discusses the implementation of the so-called BigPetStore project (which is a part of Apache Bigtop) in Apache Spark and Apache Flink. The aim BigPetStore is to provide a common suite to test and benchmark Big Data installations. The talk features best practices and implementation with the batch, streaming, SQL, DataFrames and machine learning APIs of Apache Spark and Apache Flink side by side. A range of use cases are outlined in both systems from data generation, through ETL, recommender systems to online prediction.

Speakers
avatar for Márton Balassi

Márton Balassi

Solutions Architect, Cloudera
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences... Read More →


Tuesday November 15, 2016 15:30 - 16:20 CET
Nervion/Arenal I

16:30 CET

Unified Benchmarking of Big Data Platforms - Axel-Cyrille Ngonga Ngomo, INFAI
Which Big Data Platform shown I use for my problem? This question remains one of the most important question for practitioners. In this talk, we will present the universal benchmarking platform for Big Data HOBBIT (htpp://project-hobbit.eu). The platform providies a unified approach for benchmarking Big Data frameworks. Mimicking algorithms generated from real data ensure that the dataset used for benchmarking resemble real data but are open for all to use, therewith circumventing the issues that come about when using company-bound data. The core of the platform implements industry-relevant KPI gathered from more than 70 Big-Datad-driven organizations. The results are generated using machine-readable formats so as to ensure that they can be analyzed and use for improving toold and frameworks. In the talk, I will present the architecture of the framework and some preliminary results.

Speakers
avatar for Axel-Cyrille Ngonga Ngomo

Axel-Cyrille Ngonga Ngomo

Head of Research Group, INFAI
Head of AKSW (http://aksw.org) at University of Leipzig/InfAI, a research group with ca. 50 members. Author of 120+ research papers and 20+ presentations are top-tier conferences. Received manifold research awards including Next Einstein Forum award 2016, 12 best research paper awards... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Giralda VI/VII

16:30 CET

Apache Sentry - High Availability - Sravya Tirukkovalur & Hao Hao, Cloudera
As big data continues to get bigger, deploying flexible and robust security is more important than ever. In this talk, we'll discuss about Apache Sentry, a central service for policy management and its various pluggable authorization engines which integrates with many Hadoop components. And we will dive deep into how its latest design allows for fault tolerance, high availability and scalability.



Unlike traditional database systems, authorization in Hadoop eco system is a tricky problem due to the fact that there are multiple doors to the same data. Sentry provides a great deal of usability by letting users define policies once and it replicates the state as necessary. With this comes additional challenges of designing a distributed service which manages consistent state. This talk will touch upon core design choices which lie as building blocks to any robust distributed system.

Speakers
avatar for Hao Hao

Hao Hao

Software Engineer, Cloudera Inc
Hao Hao is a software engineer at Cloudera. She is an active committer and a PMC member of Apache Sentry project. Hao has performed extensive research on smartphone security, web security while she was a PhD student at Syracuse University. Prior to joining Cloudera, Hao worked at... Read More →
avatar for Sravya Tirukkovalur

Sravya Tirukkovalur

Software Engineer, Cloudera
Sravya Tirukkovalur is a software engineer at Cloudera working on Hadoop security. She is one of the active contributors to the Apache Sentry project and also the PMC Chair. She got her Masters degree from The Ohio State University, with her research focus on High performance and... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Nervion/Arenal I

16:30 CET

Meerkat: Anomaly Detection as a Service - Julien Herzen, Swisscom
Julien will present Meerkat, a system built at Swisscom to do real-time anomaly detection on time series. Meerkat uses a combination of machine learning and big data technologies in order to trigger alerts in case of problems in Swisscom network.

Meerkat monitors arbitrary time series and trains statistical models that can be used to spot anomalies from both batch (historical) and streaming (live) data. It is composed of a Python modules for anomaly detection and data ingestion from Druid, as well as Scala modules using Apache Spark for ingesting from Apache Kafka and Apache Hadoop's HDFS.

Meerkat is currently successfully used at Swisscom to trigger alerts in case of problems with VoIP calls, which represent more than 3 millions phone calls per day.

This is joint work with Khue Vu, who worked on Meerkat for his MSc thesis at EPFL, and the network intelligence team of Swisscom Innovation.

Speakers
avatar for Julien Herzen

Julien Herzen

data scientist, Swisscom
Julien is a data scientist at Swisscom. His experience lies in the areas of machine learning and network algorithms, and his current work includes building analytics and monitoring platforms using big data technologies such as Apache Spark, Druid and Apache Cassandra. He has a PhD... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Giralda V

16:30 CET

The Myth of the Big Data Silver Bullet - Why Requirements Still Matter - Nick Burch, Quanticate
We've all heard the hype - Big Data will solve all your storage, processing and analytic problems effortlessly! As Big Data moves along the adoption cycle, there's a wider range of possible technologies and platforms you could use, but sadly picking the right one still remains crucial to success.  Some moving beyond the buzzwords to deploy Big Data find things really do work well, but others rapidly run into issues. The difference usually isn't the technologies or the vendors per-se, but their appropriateness to the requirements, which aren't always clear up-front...

This session won't tell you what Big Data solution you need. Instead, we'll cover some of the pitfalls, and help you with the questions towards working out your requirements in time for your Big Data system to be a success!

Speakers
NB

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in "Content" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance.Nick is CTO at Quanticate, a... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Nervion/Arenal II/III

16:30 CET

Avro: Travel Across (r)evolution - Arek Osinski & Darek Eliasz, Allegro Group
In those days, we are generating enormous amount of data. Biggest challenge is hidden in transformation of raw data to knowledge. We would like to take you on a short travel and show our approach for conversion from non-structured world of microservices to the world with Avro schemas inside our data pipelines.

Avro is well known format for storing and online processing information of any kind. What are key features of this format? What are the common problems? Where you can meet pitfails? How this influences our Big Data ecosystem?

Whole story will be covered by examples from real life implementation.

Speakers
avatar for Dariusz Eliasz

Dariusz Eliasz

Senior Data Platform Engineer, Grupa Allegro Sp. z o.o.
Mainly interested in: - big data platform architecture - data governance Enthusiast of scalable distributed solutions, processing large amounts of data and continuous improvement.
AO

Arek Osinski

Senior Data Platform Engineer, Allegro
Works in Allegro Group as a senior data engineer. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cyclin... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Carmona

16:30 CET

Multi-Tenant Machine Learning with Apache Aurora and Apache Mesos - Stephan Erb, Blue Yonder GmbH
Data scientists care about statistics and fast iteration cycles for their experiments. They should not be concerned with technicalities like hardware failures, tenant isolation, or low cluster utilization. In order to shield its data scientists from these matters, Blue Yonder is using Apache Aurora.



When adopting Aurora, our goal was to run multiple machine learning projects on the same physical cluster. This talk will go into details of this adoption process and highlight key engineering decisions we have made. Particular focus will reside on the multi-tenancy and oversubscription features of Apache Aurora and Apache Mesos, its underlying resource manager.



Audience members will learn about the fundamentals of both Apache projects and how those can be assembled into a capable machine learning platform.

Speakers
avatar for Stephan Erb

Stephan Erb

Software Engineer, Blue Yonder GmbH
Stephan Erb is a software engineer driven by the goal to make Blue Yonder's data scientists more productive. Stephan holds a master's degree in computer science from the Karlsruhe Institute of Technology (KIT). He is a PMC member of the Apache Aurora project and tweets at @ErbSte... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Santa Cruz

16:30 CET

Ranking the Web with Spark - Sylvain Zimmer, Common Search
Common Search is building an open source search engine based on Common Crawl's monthly dumps of several billion webpages. Ranking every URL on the Web in a transparent and reproducible way is core to the project.



In this presentation, Sylvain Zimmer will explain why Spark is a great match for the job, how the current ranking pipeline works and what challenges it faces to grow in scale and complexity, in order to improve the quality of search results.



Specifically, we will dive in the new Spark 2.0 features that made it practical to compute PageRank from Python on every URL found in Common Crawl, and show how anyone can reproduce and tweak the results on their cloud servers.

Speakers
SZ

Sylvain Zimmer

Founder, Common Search
Sylvain Zimmer is a software developer and longtime free culture advocate. In 2004 he founded Jamendo, the largest Creative Commons music community online. Since 2012, he has been the CTO of Pricing Assistant, a startup specialized in large-scale crawling of E-commerce websites. He... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Giralda III/IV

16:30 CET

Writing Apache Spark and Apache Flink Applications Using Apache Bahir - Luciano Resende, IBM
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.

Speakers
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for... Read More →


Tuesday November 15, 2016 16:30 - 17:20 CET
Giralda I/II

17:20 CET

 
Wednesday, November 16
 

07:00 CET

Morning Run
Come meet in the lobby of the Melia Sevilla at 7:00am for a morning run. The plan is to cross to the park and run next to the river. 

This will last an hour and the group will be back by 8:00am.

Wednesday November 16, 2016 07:00 - 08:00 CET
Melia Sevilla Hotel Lobby

08:30 CET

Breakfast
Wednesday November 16, 2016 08:30 - 09:30 CET
Giralda Foyer

08:30 CET

Sponsor Showcase
Wednesday November 16, 2016 08:30 - 12:00 CET
Triana Foyer

08:30 CET

Registration
Wednesday November 16, 2016 08:30 - 13:00 CET
Triana Foyer

09:30 CET

Keynote: Introduction to Tensorflow: Tips and Tricks for Neural Net Design - Gema Parreño, AI Developer
Tensorflow has been part of the core of google search engine and it is provided as an open source online tool since last novemeber . The keynote will introduce into the architecture of the library focused on machine vision and will dive into data modeling of global finalist 2016 NASA SPACE APPS CHALLENGE project.

Speakers
GP

Gema Parreño

AI Developer
Gema Parreño is a several times awarded product designer that has been focused into Artificial Intelligence and software architecture for 2 years highlighting experiences with Natual Languaje Understanding. She is now  developing recursive neural networks and clustering classifications... Read More →


Wednesday November 16, 2016 09:30 - 09:50 CET
Giralda I/II

09:55 CET

Keynote: Lessons from the Trenches: How Apache Hadoop is Being Used & The Challenges Its Users Face - John Mertic, Director, ODPi and Open Mainframe Project, Linux Foundation
Apache Hadoop has earned the support of a large & diverse community, with significant interest from businesses, governments, academia & technology vendors – each varying in their goals & objectives for benefiting from the technology. While the distributed data platform’s ecosystem continues to grow, there remains some debate about its ease of adoption & how a wide-range of users can gain business value from it. This session from John Mertic, Director of Program Management for ODPi, will cover how solution providers, app vendors & end users are deploying Apache Hadoop, the daily challenges they face in their environments, how they’d like to use the technology moving forward & much more. Citing insights from ODPi members Capgemini, Linaro & GE, Mertic will break down what he’s learned to demystify the most common Apache Hadoop complexities & barriers to further enterprise adoption.

Speakers
avatar for John Mertic

John Mertic

Executive Director, Open Mainframe Project
John Mertic is the Director of Program Management for The Linux Foundation. Under his leadership, he has helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both... Read More →


Wednesday November 16, 2016 09:55 - 10:10 CET
Giralda I/II

10:00 CET

BarCampApache
Join us for an ‘unconference’ with no set schedule, facilitated by those involved in various Apache projects. More details and registration information can be found here:
https://wiki.apache.org/apachecon/BarCampApacheSeville

Wednesday November 16, 2016 10:00 - 16:20 CET
Estepa

10:15 CET

Coffee Break
Wednesday November 16, 2016 10:15 - 11:00 CET
Giralda Foyer

11:00 CET

On-Premise, UI-Driven Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service - Jim Dowling, KTH Royal Institute of Technology
Since April 2016, SICS Swedish ICT has provided Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service to researchers in Sweden. We have developed a UI-driven multi-tenant platform (Apache v2 licensed) in which researchers securely develop and run their applications. Applications can be either deployed as jobs (batch or streaming) or written and run directly from Notebooks in Apache Zeppelin. All applications are run on YARN within a security framework built on project-based multi-tenancy. A project is simply a grouping of users and datasets. Datasets are first-class entities that can be securely shared between projects. Our platform also introduces a necessary condition for elasticity: pricing. Application execution time in YARN is metered and charged to projects, that also have HDFS quotas for disk usage. We also support project-specific Kafka topics that can also be securely shared.

Speakers
avatar for Jim Dowling

Jim Dowling

CEO, Logical Clocks
Jim Dowling is an Associate Professor at KTH Royal Institute of Technology in Stockholm as well as a Senior Researcher at SICS Swedish ICT. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is lead architect of Hops... Read More →


Wednesday November 16, 2016 11:00 - 11:50 CET
Giralda III/IV

11:00 CET

Why is My Hadoop Cluster Slow? - Steve Loughran, Hortonworks
Apache Hadoop is used to run jobs that execute tasks over multiple machines with complex dependencies between tasks. And at scale, there can be 10äó»s to 1000äó»s of tasks running over 100's to 1000äó»s of machines which increases the challenge of making sense of their performance. Pipelines of such jobs that logically run a business workflow add another level of complexity. No wonder that the question of why Hadoop jobs run slower than expected remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache... Read More →


Wednesday November 16, 2016 11:00 - 11:50 CET
Carmona

11:00 CET

Introduction to Apache Beam - Jean-Baptiste Onofré, Apache Software Foundation & Dan Halperin, Google
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam's programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.


Speakers
DH

Daniel Halperin

Google
Dan Halperin is a PMC member of Apache Beam. He has worked on Beam and Google Cloud Dataflow for 2 years. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in... Read More →
JO

Jean-Baptiste Onofré

Talend
JB is PMC member for Apache Beam. He is a long-tenured Apache member, serving on PMC/committer for about 15 projects that range from integration to big data.


Wednesday November 16, 2016 11:00 - 11:50 CET
Nervion/Arenal II/III

11:00 CET

Big Data Machine Learning with Apache PredictionIO - Simon Chan, Salesforce
Apache PredictionIO (incubating) provides a full stack machine learning environment on top of Apache Spark, making it easy for developers to iterate on production-deployable machine learning engines. Apache PredictionIO is designed for data scientists and developers to build predictive web services for real-world applications in a fraction of the time normally required.

In this talk, the speaker will introduce the latest developments of PredictionIO, and show how to use it to build and deploy predictive engines in real production environments. Using PredictionIOäó»s DASE design pattern, Simon will illustrate how developers can build machine learning applications with the separation of concerns (SoC) in mind. The speaker will also go over the future roadmap of Apache PredictionIO and some of its recent development.


Speakers
avatar for Simon Chan

Simon Chan

Senior Director, Einstein, Salesforce
@simonchannet Simon Chan is a Senior Director of Product Management for Salesforce Einstein where he oversees platform development and delivers products that empower anyone to build smarter apps with Salesforce. Simon is a product innovator and serial entrepreneur with more than... Read More →


Wednesday November 16, 2016 11:00 - 11:50 CET
Giralda V

11:00 CET

Machine Learning on Apache Apex with Apache Samoa - Bhupesh Chawda, DataTorrent Software
This talk will be about the integration of Apache Samoa, a distributed streaming machine learning framework (https://samoa.incubator.apache.org) with Apache Apex, a distributed, scalable and fault-tolerant stream processing engine (https://apex.apache.org). Apache Samoa is a kind of WORA (write-once-run-anywhere) framework where algorithms developed on Samoa can be run on other distributed stream processing engines like Storm, Samza and Flink. This talk will introduce the integration story with Apache Apex and outline the process and the challenges therein. In addition, the talk will also dwell upon some comparative analysis on the performance of Samoa algorithms on few popular integrated runners, namely Apache Storm, Apache Flink and Apache Apex.

Speakers
avatar for Bhupesh Chawda

Bhupesh Chawda

Software Engineer, DataTorrent Software India Pvt. Ltd.
Bhupesh Chawda is a Software Engineer at DataTorrent Software India Pvt. Ltd. He is also a committer on the Apache Apex project under the Apache Software Foundation. His current interests include big data and distributed systems, stream processing and machine learning. He has experience... Read More →


Wednesday November 16, 2016 11:00 - 11:50 CET
Santa Cruz

11:00 CET

Attacking a Big Data Developer - Olaf Flebbe, science+computing ag
Developers are a possible attack vector for targeted attacks to infiltrate malicious code

into enterprises.



The Speaker did a network traffic analysis with the Bro Network Security Monitor (bro.org)

backed by an ELK Stack while compiling Apache Bigtop, a Big Data Distribution containing

Apache Hadoop, Spark, HBase, Hive, Flink et al.



While there are no obvious traces of a malicious code within the traffic, there are many

findings of possible attack vectors like unsecurely configured critical software infrastructure

servers, usage of private repositories or unsecure protocols.



The Analysis showed that many compile jobs are downloading and running executables from untrusted sources.

The author will shortly explain how these weaknesses can be exploited and will give recommendations on how to resolve these issues.

Speakers
OF

Olaf Flebbe

Chief Software Architect
Dr. Olaf Flebbe received his PhD in computational physics in Tübingen, Germany. He works as the chief software architect at science+computing ag. He is a member of the PMC of Apache Bigtop. Occasionally he gives talks about random projects at various conferences.


Wednesday November 16, 2016 11:00 - 11:50 CET
Giralda I/II

11:00 CET

Shared Memory Layer and Faster SQL for Spark Applications - Dmitriy Setrakyan, GridGain
In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior... Read More →


Wednesday November 16, 2016 11:00 - 11:50 CET
Nervion/Arenal I

11:00 CET

SQL and Streaming Systems - Atri Sharma, Microsoft
The talk shall focus on to design and build systems for stream based data and exploiting the power of SQL and relational algebra on Streaming data using Apache Apex and Apache Calcite.

Speakers
avatar for Atri Sharma

Atri Sharma

SDE-II, Microsoft
A distributed systems engineer, committer on Apache Apex, PMC Member on Apache MADLib, PPMC Member on Apache HAWQ and major contributor in PostgreSQL Project, having implemented GROUPING SETS, ROLLUP, CUBE and Ordered Set Aggregates


Wednesday November 16, 2016 11:00 - 11:50 CET
Giralda VI/VII

12:00 CET

Get in Control of Your Workflows with Apache Airflow - Christian Trebing, Blue Yonder
Whenever you work with data, sooner or later you stumble across the definition of your workflows. At what point should you process your customeräó»s data? What subsequent steps are necessary? And what went wrong with your data processing last Saturday night?



At Blue Yonder we use Apache Airflow to solve these problems. It can be extended with new functionality by developing plugins in Python. With Airflow, we define workflows as directed acyclic graphs and get a shiny UI for free. Airflow comes with some task operators which can be used out of the box to complete certain tasks. For more specific cases, you can also develop new operators in your plugin.



This talk will explain the concepts behind Airflow, demonstrating how to define your own workflows and how to extend the functionality. Youäó»ll also get to hea about our experiences using this tool in real-world scenarios.

Speakers
CT

Christian Trebing

Senior Software Engineer
Christian is a Software Developer from Karlsruhe, Germany. He has studied Computer Science at TU Darmstadt. Currently he is working on big data applications at Blue Yonder, enjoying the challenges at the intersection between software engineering and data science.


Wednesday November 16, 2016 12:00 - 12:50 CET
Carmona

12:00 CET

Mining and Identifying Security Threat Using Spark SQL, HBase and Solr - Manidipa Mitra, ValueLabs
This presentation will talk about how to deisgn a highly effective scalable/performant distributed system to find the identity theft and fraud by mining billions of records related to share holding for a leading financial organization. This will also discuss on how Tera bytes of data can be migrated from Oracle to Hadoop, stored in parquet format, processed in a distributed computing framework with Spark DataFrame and pushed to different service layer (HBase, Impala, Solr, HDFS) depends on the query/access pattern. This design will also throw light on how the frequent transactions were handled and data were pre-processed end of the day to meet the seconds response time SLA, creating thousands of report by mining millions of record in minutes time.

Speakers
avatar for Manidipa Mitra

Manidipa Mitra

Director, ValueLabs
Manidipa Mitra heads the Big Data CoE in ValueLabs having extensive experience in building industry specific solution using distributed computing and cloud technologies . Having 16+ years of software industry experience and in-depth knowledge on disruptive-technologies, Cloud and... Read More →


Wednesday November 16, 2016 12:00 - 12:50 CET
Giralda III/IV

12:00 CET

Smart Storage Management: Towards Higher HDFS Storage Efficiency - Wei Zhou, Intel
All kinds of data volume increases dramatically in recent years, new storage devices (NVMe SSD, flash SSD, etc.) can be utilized to improve data access performance. HDFS provides methodologies like HDFS Cache, Heterogeneous Storage Management (HSM) and Erasure Coding (EC) to provide such support, but it remains a big challenge to define and adjust different storage strategies for different data in a dynamic environment.

To overcome the challenge and improve the storage efficiency of HDFS, we will introduce a comprehensive solution, aka Smart Storage Management (SSM) in Apache Hadoop. HDFS operation data and system state information are collected from the cluster, based on the metrics collected SSM can extract some äóìdata access patternsäó and based on these patterns SSM will automatically make sophisticated usage of these methodologies to optimize HDFS storage efficiency.

Speakers
WZ

Wei Zhou

Software engineer in Intel. Currently mainly focus on Apache Hadoop performance optimization. Co-speaker on HBase Developer Course in Strata+Hadoop world Beijing 2016.


Wednesday November 16, 2016 12:00 - 12:50 CET
Giralda V

12:00 CET

Hands On! Deploying Apache Hadoop Spark Cluster with HA, Monitoring, and Logging in AWS - Andrew Mcleod & Peter Vander Giessen, Canonical
This is a hands-on workshop style session where attendees will learn how to deploy complex workloads such as a 10 node Hadoop Spark cluster complete with HA, Logging, and Monitoring. We can then scale the cluster from there pending needs. Attendees will also learn how to deploy other workloads such as connecting Apache Kafka into the Solution, connecting Apache Zeppelin into the solution, or trying the latest Cloud Native Kubernetes. We will then run a sample TeraSort, Spark Job, and Pagerank benchmak to get familiar with the cluster. An AWS controller will be provided for folks who don't have cloud access.
No prior knowledge is needed, but if you want to get a head start install the Juju client by following the docs @ http://jujucharms.com/get-started


Wednesday November 16, 2016 12:00 - 12:50 CET
Giralda I/II

12:00 CET

Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics - Mike Percy, Cloudera
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.



This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.

Speakers
avatar for Mike Percy

Mike Percy

Software Engineer, Cloudera
Mike Percy is a software engineer at Cloudera and a PMC member on Apache Kudu, an open source distributed column store for the Hadoop ecosystem. He is also a PMC member on Apache Flume. Prior to joining Cloudera, Mike worked at Yahoo! building machine learning infrastructure for Big... Read More →



Wednesday November 16, 2016 12:00 - 12:50 CET
Nervion/Arenal I

12:00 CET

How Software Engineering Has Changed with Advent of OSS - Nupur Sharma, Ingenium Data Systems
The talk shall explore the business of open source and how open source has changed the way software engineering is done and executed. Earlier, every software process was done as a ballpark project and designed with commercial, non-extensible products in mind.With the new open source paradigm, companies are now driving software development with open source products as the core and leveraging the extensibility of the product itself. In the talk, Nupur shall drive through the thought process of product designers through the 1990s, 2000s and now. Nupur shall explain how organisations are adapting Open Source Software and building their entire business models around them. Driving through some use cases, the transition from closed source to open source in many existing and well thought processes shall be discussed and explored. This shall enlighten any org exploring to move to OSS paradigm.

Speakers
avatar for Nupur Sharma

Nupur Sharma

Director, Ingenium Data Systems
A serial entrepreneur, founded GITC in 2005 and currently co founder and CEO of Ingenium Data Systems, a big data startup in India. She is one of India's original commercial software developers, having experience in developing products across a wide spectrum since 1989. She is currently... Read More →


Wednesday November 16, 2016 12:00 - 12:50 CET
Giralda VI/VII

12:00 CET

Performance Tuning Tips for Apache Spark Machine Learning Workloads - Shreeharsha GN & Amir Sanjar, IBM
Performance Tuning tips for Apache Spark Machine Learning workloads - OpenPOWER 8 architecture is the latest offering of IBM SoftLayer, is the perfect platform for evaluating and optimizing Apache Spark solutions. In under 60 minutes from receiving a Sotlayer welcome package to your new bare-metal Power8 server, you can have Hadoop and Spark, along with many other software applications, installed, configured, optimized, and ready to run Spark ML workload. In this talk we will cover: 1) Apache Spark overview 2) Apache Spark software deployment 3) Spark optimization on highly threaded server 4) Demo"

Speakers
avatar for Shreeharsha GN

Shreeharsha GN

Lead Engineer, AMD
Shreeharsha GN has many years of experience in the field of Performance Engineering for software applications and Java stack optimization, big data software and IBM java stack performance optimization at companies including IBM, Azul systems , HCL, Infosys. He is a SPEC member and... Read More →


Wednesday November 16, 2016 12:00 - 12:50 CET
Santa Cruz

12:00 CET

Apache CouchDB 2.0 Sync Deep Dive - Jan Lehnardt, Neighbourhoodie Software
This talks takes a deep dive below the magic and explains how to build robust sync systems, whether you want use CouchDB or build your own.

The talk will go through the components of a successful data sync system and which trade-offs you can take that solves your particular problems.

Reliable data sync, from Big Data to Mobile.

Speakers
avatar for Jan Lehnardt

Jan Lehnardt

CEO, Neighbourhoodie Software
Jan Lehnardt is the PMC Chair and VP of Apache CouchDB, co-creator of the Hoodie web app framework based on CouchDB as well as the founder and CEO of Neighbourhoodie Software. He’s the longest standing contributor to Apache CouchDB.


Wednesday November 16, 2016 12:00 - 13:00 CET
Nervion/Arenal II/III

13:00 CET

Highly Scalable Big Data Analytics with Apache Drill - Tom Barber, Meteorite Consulting
Big Data analytics is becoming more and more popular as the query response times improve. We'll look at building and deploying a fully operational and highly scalable Apache Bigtop based Big Data Analytics platform with no code.

In this talk we'll utilise the power of the open source Juju application modelling platform to deploy our software and configure it for us. We'll also discuss deployment options, scalability and resilliency allowing users to get the most from the data.

Speakers
avatar for Tom Barber

Tom Barber

Technical Director, Spicule LTD
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Carmona

13:00 CET

Distributed Logistic Model Trees - Mateo Alvarez & Antonio Soriano, Stratio Big Data
Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano will present a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leafs. While being highly interpretable, the LMT consistently performs equal or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.

Speakers
MA

Mateo Alvarez

Big Data developer/ Data Scientist, Stratio
Mateo Álvarez studied aerospace engineering at the Universidad Politécnica de Madrid, with a masters degree in Propulsion Systems, and Data Science in the Universidad Rey Juan Carlos. He is passionate about data analysis with Scala, Python and all Big Data technologies, and is curren... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Giralda VI/VII

13:00 CET

Real Time Aggregation with Kafka, Spark Streaming and ElasticSearch, Scalable Beyond Million RPS - Dibyendu Bhattacharya, InstantLogic
While building a massively scalable real time pipeline to collect transaction logs from network traffic, one of the major challenges was performing aggregation on streaming data on the fly. This was needed to compute multiple metrics across various dimensions which help our customer to see near real time views of application delivery and performance. In this talk, learn how we designed our real time pipeline for doing multi-stage aggregation powered by Kafka ,Spark Streaming and ElasticSearch. At InstartLogic we used custom Spark Receiver for Kafka which is used in first stage aggregation. The second stage includes Spark Streaming driven aggregation within given batch window . Final stage aggregation involves custom ElasticSearch plugins to aggregate across Batches. I will cover this multi-stage aggregation,including optimisation across all stages which is scalable beyond million RPS

Speakers
avatar for Dibyendu Bhattacharya

Dibyendu Bhattacharya

Data Platform Engineer, InstartLogic
Dibyendu Holds MS in Software Systems and B.Tech in Computer Science having experience in building applications and products leveraging distributed computing and big data technologies. Presently working as Data Platform Engineer at InstartLogic, the world's first endpoint-aware application... Read More →



Wednesday November 16, 2016 13:00 - 13:50 CET
Giralda III/IV

13:00 CET

Scio, a Scala DSL for Apache Beam - Robert Gruener, Spotify
Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.

Speakers
RG

Robert Gruener

Software Engineer, Spotify
I have been at Spotify for 3 years working on popular music recommendation features such as Discover Weekly and Release Radar. At Spotify I have been a large user of Scalding, Cassandra, and now Scio in order to make sense of our huge amount of data and find the perfect song to present... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Nervion/Arenal II/III

13:00 CET

Parquet Format in Practice & Detail - Uwe L. Korn, Blue Yonder
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

Speakers
avatar for Uwe L. Korn

Uwe L. Korn

Data Scientist, Blue Yonder GmbH
Uwe Korn is a Data Scientist at the German RetailTec company Blue Yonder. His expertise is on building architectures for machine learning services that are scalably usable for multiple customers aiming at high service availability as well as rapid prototyping of solutions to evaluate... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Nervion/Arenal I

13:00 CET

On the Representation and Reuse of Machine Learning Models - Villu Ruusmann, Openscoring Ltd.
Big Data applications rely on machine learning to derive new value. Model training and deployment are handled by different people in different environments, which makes model transferability a major concern.



This talk inquires into popular R, Scikit-Learn and Apache Spark model types, and connects them at a standardized PMML representation level. PMML adds value to all stages of the workflow, starting from model interpretation, reorganization and persistence, and ending with fully-automated model deployment to schema-full Big Data frameworks.



Attendees will learn that models are not locked-in "black boxes", but easily accessible and programmable components in the application layer. This realization should translate to improved workflows, and smarter and more performant applications.

Speakers
VR

Villu Ruusmann

CTO, Openscoring OÜ
Villu Ruusmann is the founder and CTO of Openscoring Ltd, a company provides an open source implementation of the Predictive Model Markup Language (PMML) standard. Villu has extensive knowledge about popular machine learning model training and deployment platforms, which he has turned... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Santa Cruz

13:00 CET

What's With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends - Nick Burch, Quanticate
Large amounts of unknown data seeks helpful tools to identify itself and generate content!

With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet's worth, this won't scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.

First we'll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we'll look a how to roll this all out on a Big Data scale.

Speakers
NB

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in "Content" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance.Nick is CTO at Quanticate, a... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Giralda I/II

13:00 CET

Apache Ignite - JCache and Beyond - Dmitriy Setrakyan, GridGain
This presentation will provide a good overview of Apache Ignite project including a detailed look into distributed in-memory Data Grid, Compute Grid, Streaming, in memory SQL, and many other components provided by Apache Ignite. We will also go into detail of how existing in-memory caching products and data grids can be used to share memory across Apache Spark jobs and applications. We will also present a hands on demo demonstrating performance benefits of querying shared memory using SQL.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior... Read More →


Wednesday November 16, 2016 13:00 - 13:50 CET
Giralda V
 
Filter sessions
Apply filters to sessions.