Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
Apache: Big Data Europe 2016
Click here to Register or for more information 
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Sunday, November 13
 

17:00

Pre-registration Open
Sunday November 13, 2016 17:00 - 19:00
Triana Foyer
 
Monday, November 14
 

07:00

Morning Run
Come meet in the lobby of the Melia Sevilla at 7:00am for a morning run. The plan is to cross to the park and run next to the river. 

This will last an hour and the group will be back by 8:00am.

Monday November 14, 2016 07:00 - 08:00
Melia Sevilla Hotel Lobby

08:30

Breakfast
Monday November 14, 2016 08:30 - 09:30
Giralda Foyer

08:30

Registration
Monday November 14, 2016 08:30 - 17:20
Triana Foyer

09:30

Keynote: Welcome & Opening Remarks - Rich Bowen, Vice President, Conferences, Apache Software Foundation
Speakers
avatar for Rich Bowen

Rich Bowen

Executive Vice President, Apache Software Foundation
Rich is a member, and the Executive Vice President, of the Apache Software Foundation. He has spoken at almost every ApacheCon. Rich works on the Apache HTTP Server project, and is the author of a few books about httpd. In his day job, he works in the Open Source and Standards group at Red Hat, where he does community things with the OpenStack project. He lives in Lexington, Kentucky.


Monday November 14, 2016 09:30 - 09:40
Giralda I/II

09:45

Keynote: Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It - Stephan Ewen, CTO, Data Artisans
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays.

The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges.

We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.

Speakers
avatar for Stephan Ewen

Stephan Ewen

CTO, Data Artisans
Stephan Ewen is Apache Flink PMC member and co-founder and CTO of data Artisans. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project. Stephan has a PhD in Computer Science from TU Berlin.


Monday November 14, 2016 09:45 - 10:05
Giralda I/II

10:10

Keynote: Training Our Team in the Apache Way - Alan Gates, Co-Founder, Hortonworks
Hortonworks contributes to a number of Apache projects.  When we started we depended on our many experienced Apache community members to train their fellow Hortonworkers in the Apache Way.  But we grew quickly, and we found this started to break down.  So we have instituted training for our teams in what Apache is, how it works, their responsibilities as part of Apache and how that meshes with their responsibilities as Hortonworkers, and a practical list of dos and don’t. This talk will share some thoughts on the need for this training, give an overview of the content, and review some early results.

Speakers
avatar for Alan Gates

Alan Gates

Co-founder and Architect, Hortonworks
Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O'Reilly Press.


Monday November 14, 2016 10:10 - 10:25
Giralda I/II

10:25

Coffee Break
Monday November 14, 2016 10:25 - 11:00
Giralda Foyer

11:00

Demonstrating the Societal Value of Big & Smart Data Management - Simon Scerri, Fraunhofer IAIS
H2020 BigDataEurope is a flagship project of the European Union's Horizon 2020 framework programme for research and innovation. In this talk we present the Docker-based BigDataEurope platform, which integrates a variety of Big Data processing components such as Hive, Cassandra, Apache Flink and Spark. Particularly supporting the variety dimension of Big Data, it adds a semantic data processing layer, which allows to ingest, map, transform and exploit semantically enriched data. In this talk, we will present the innovative technical architecture as well as applications of the BigDataEurope platform for life sciences (OpenPhacts), mobility, food & agriculture as well as industrial analytics (predictive maintenance). We demonstrate how societal value can be generated by Big Data analytics, e.g. making transportation networks more efficient or facilitating drug research.

Speakers
SS

Simon Scerri

BDE Deputy Coordinator, Fraunhofer IAIS
Simon Scerri is a senior postdoc in the “Enterprise Information Systems” department at Fraunhofer IAIS and at the University of Bonn. In 2011, Simon received his Ph.D. from the Faculty of Engineering at the National University of Ireland, Galway. Prior to joining Fraunhofer, Simon contributed to research efforts (2005–2013) at the Digital Enterprise Research Institute (DERI). In 2014–2015 Simon was an ERCIM fellow... Read More →


Monday November 14, 2016 11:00 - 11:50
Santa Cruz

11:00

The Role of Apache Big Data Stack in Finance: A Real World Experience on Providing Added Value to Online Customers - Luca Rosellini, KEEDIO
Nowadays, the main burden of BigData adoption is clearly the integration of new infrastructure and technologies with legacy systems, especially when dealing with data ingestion.



In this talk, KEEDIO will present the details of the aforementioned BigData architecture, deployed in a hybrid infrastructure for a rising bank in Spain, in order to provide added value to its customers. This success story has been possible by means of custom analytics built on top of several components of the Apache Stack.



The main and most interesting issues of this deployment will be explained as well as the their solutions based on tools like Apache NiFi, Apache Spark, Apache Mesos and Apache Zeppelin. Thus, the complete ingestion architecture will be outlined, as well as data consolidation and processing.



Finally, the low latency online data exploitation architecture will be explained.

Speakers
LR

Luca Rosellini

CTO, KEEDIO
Luca has been working on Big Data project for major Spanish corporations for the last four years. He now serves as the CTO of KEEDIO, a young spanish startup focused in solving BigData problems in banking environments. He holds a master degree in computer engineering at the university of Pisa, Italy. Previous talks: Spark Summit 2013 BigData Spain 2013


Monday November 14, 2016 11:00 - 11:50
Giralda III/IV

11:00

Geospatial Track: Apache SIS for Earth Observation and Beyond - Martin Desruisseaux, Geomatys
Apache SIS is a library for helping developers to create their own geospatial application. SIS follows closely international standards published jointly by the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). In this talk we will show how SIS provides a unified metadata model based on ISO 19115 standard for summarizing the content of some file formats used for earth observation: GeoTIFF, NetCDF, Landsat 8 and MODIS. We will show how to get the Coordinate Reference System (CRS) from those file formats or from other sources like Well Known Text (WKT) 2 or registry maintained by authorities, and how to use those CRS for coordinate operations. We will present new issues to take in account when applying those tools to extra-terrestrial bodies like Mars or asteroids. Finally we will present next developments proposed for Apache SIS.

Speakers
MD

Martin Desruisseaux

Developer, Geomatys
I hold a Ph.D thesis in oceanography, but have continuously developed tools for helping analysis work. I used C/C++ before to switch to Java in 1997. I develop geospatial libraries since that time, initially as a personal project then as a GeoTools contributor until 2008. I'm now contributing to Apache SIS since 2013. I attend to Open Geospatial Consortium (OGC) meetings about twice per year in the hope to follow closely standard developments and... Read More →


Monday November 14, 2016 11:00 - 11:50
Carmona

11:00

Practical Graph Analytics with Apache Giraph - Roman Shaposhnik, Pivotal
This talk will help you build data mining and machine learning applications using Apache Giraph framework for graph processing. This talk is based on the "Practical Graph Analytics with Apache Giraph" book trying to be as hands-on as possible. Apache Giraph offers a simple yet flexible programming model targeted to graph algorithms and designed to scale easily to accommodate massive amounts of data. Originally developed at Yahoo!, Giraph now enjoys a diverse community of contributors from who's-who of Silicon Valley companies: Facebook, LinkedIN and Twitter.

Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Pivotal Inc.
Roman Shaposhnik is a Director of Open Source at Pivotal Inc. He is a committer on Apache Hadoop, co-creator of Apache Bigtop and contributor to various other Hadoop ecosystem projects. He is also an ASF member and a former Chair of Apache Incubator. In his copious free time he managed to co-author "Practical Graph Analytics with Apache Giraph" and he also posts to twitter as @rhatr. Roman has been involved in Open Source software for more than a... Read More →


Monday November 14, 2016 11:00 - 11:50
Giralda VI/VII

11:00

Hive 2.0 SQL, Speed, Scale - Alan Gates, Hortonworks
Apache Hive is the most commonly used SQL interface for Hadoop. To meet users data warehousing needs it must scale to petabytes of data, provide the necessary SQL, and perform in interactive time. The Hive community ihas produced a 2.0 release of Hive that includes significant improvements. These include:

* LLAP, a daemon layer that enables sub-second response time.

* HBase to store Hiveäó»s metadata, resulting in significantly reduced planning time.

* Using Apache Calcite to build a cost based optimizer

* Adding procedural SQL

* Improvements in using Spark as an engine for Hive execution

This talk will cover the use cases these changes enable, the architectural changes being made in Hive as part of building these features, and share performance test results on how these improvements are speeding up Hive.

Speakers
avatar for Alan Gates

Alan Gates

Co-founder and Architect, Hortonworks
Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O'Reilly Press.


Monday November 14, 2016 11:00 - 11:50
Nervion/Arenal II/III

11:00

Putting A Spark in Web Apps - David Fallside, IBM
Web app developers want to take advantage of the sophisticated analytics and big data processing provided by engines such as Apache Spark. Traditionally, web app development would use an enterprise language like Java but with a development emphasis now on agility and simplicity, technologies such as Node.js, Ruby on Rails, and PHP are increasingly being used for such development. Apache Spark has APIs for Scala, Java and Python but no API for Node.js or JavaScript despite their importance for web app development. To fill this gap, the EclairJS open-source project was created to provide an API in Node.js and JavaScript and enable web app developers to incorporate the analytic and other capabilities of Spark. In this presentation, David Fallside will show some web applications that demonstrate Sparkäó»s capabilities and explain how they are implemented using EclairJS.

Speakers
avatar for David Fallside

David Fallside

Technologist, IBM
David Fallside works in an emerging tech team at IBM that develops the open-source EclairJS project and provides Node.js with an Apache Spark API. Some of the team’s previous projects include LoB tools for Spark and Hadoop, and information engineering for IBM’s Watson. David was responsible for creating the Apache Derby project, and has worked in several W3C working groups including XML Schema, XML SOAP, and the W3C Advisory... Read More →



Monday November 14, 2016 11:00 - 11:50
Giralda I/II

11:00

Apache Gearpump Next-Gen Streaming Engine - Karol Brejna & Huafeng Wang, Intel
Stream processing goes mainstream in the Big Data world and becomes widely adopted in the industry. Despite its expanding popularity, many hard problems remain to be solved. Apache Gearpump(incubating) is a next-gen streaming engine designed to solve the hard parts in stream processing. It is good at streaming infinite out-of-order data and guarantees correctness. It helps user to easily program streaming applications, get runtime information and update dynamically. In this presentation, we will demystify how Gearpump solves the hard parts in stream processing and achieves high throughput at millisecond latency message delivery.

Speakers
avatar for Karol Brejna

Karol Brejna

Intel
Father, husband, software enthusiast. After over a dozen years of struggling with system integration, service and event oriented/driven architectures, business process management, enterprise content management, NoSQLs, ESBs, clouds joined Intel to work for Analytics and Artificial Intelligence Solutions Group. Contributor to Trusted Analytics Platform and Apache Gearpump (incubating) open source projects.
avatar for Huafeng Wang

Huafeng Wang

Software engineer, Intel
Huafeng is a software engineer from Intel's Big Data engineering group, as well as a committer of Apache Gearpump, which is an open sourced streaming process engine initiated by Intel.



Monday November 14, 2016 11:00 - 11:50
Nervion/Arenal I

12:00

Data Science with Spark and Case Study with Non-Motorized Travel Social Data for the Public - Yi Fan Zhang, IBM
The collection, documentation, management and analysis of big data associated with non-motorized travel has not attracted enough attentions. This may not conform to the trend that cycling, walking and jogging are strongly advocated by governments to build low-carbon cities and also to improve peopleäó»s health conditions. This session will share the experience that quantify and characterize the non-motorized travel by means of tempo-spatial analysis. The data used in this case is captured from a famous online community for running amateurs sharing their activities. Around 0.5 million running and cycling records from 0.3 million people in Beijing are analyzed with machine learning and data science methodology in this case study. Spark ML with random forest algorithm, and grid search of the parameters selection have been used on the prediction upon weather, AQI and time.

Speakers
avatar for Yi Fan Zhang

Yi Fan Zhang

Software Engineer, IBM
Working in Cloud Data Service, Big data, Entity Analytics Development, IBM China Development Lab. | Recently, I am working on the Smart Traffic with People/Vehicle Trajectory Analysis Platform: Including build a Spark distributed computing environment,design and develop Spark applications: Verify the feasibility and stability of Big Data traffic information processing platform through Mobile signaling. Through the IBM Big Data distributed... Read More →


Monday November 14, 2016 12:00 - 12:50
Giralda III/IV

12:00

Geospatial Track: Geospatial Big Data: Software Architectures and the Role of APIs in Standardized Environments - Ingo Simonis, Open Geospatial Consortium (OGC)
A number of technologies have evolved around big data, in particular products from the Apache community such as Hadoop, Storm, Spark, Hive, or Cassandra. The geospatial community has developed a range of standards to handle geospatial data in an efficient way. Most of these standards are produced by the Open Geospatial Consortium (OGC) and implemented in the form of domain-agnostic data models and Web services. With the emerging demand for streamlined APIs, new questions emerge how access to Big Data in the geospatial community can be handled most efficiently, how existing standards serve these new demands and implementation realities with distributed Big Data repositories operated e.g. by the various space agencies. This presentation should stimulate the discussion of geospatial Big Data handling in standardized environments and explore the role of products from the Apache community.

Speakers
IS

Ingo Simonis

Director Interoperability Programs & Science, OGC
Dr. Ingo Simonis is director of interoperability programs and science at the Open Geospatial Consortium (OGC), an international consortium of more than 525 companies, government agencies, research organizations, and universities participating in a consensus process to develop publicly available geospatial standards. As lead architect of OGC’s prototyping and exploration program, he supervises the technical experiments and enhancements of... Read More →


Monday November 14, 2016 12:00 - 12:50
Carmona

12:00

Graph Processing with Apache Tinkerpop on Apache S2Graph - Doyung Yoon, Kakao Corp.
Since the last conference, Apache S2Graph community has been working on the integration with Apache Tinkerpop. Tinkerpop users are now able to use S2Graph as graph database without changing their Thinkerpop code, and also execute OLAP graph queries over their data in HDFS. We will share our experiences to integrate Thinkerpop as a graph database API, and comment on our current limitations and future plans. We will also present the benchmark results showing the comparison between S2Graph and existing graph databases such as Neo4j, Titan, and OrientDB. We focus our benchmarks on the "neighbors of neighbors" queries and the basic CRUD operations. Similar to Titan, S2Graph supports multiple storage backends, such as HBase, Cassandra, Mysql, Postgresql, and RocksDB, and the S2Graph's performance for each backend will be presented.

Speakers
avatar for Doyung Yoon

Doyung Yoon

Software Engineer, Kakao
Doyung works in a distributed graph database team at Kakao as software engineer, where his focus is on performance and usability. He developed Apache S2Graph, an open-source distributed graph database, and has previously presented it at ApacheCon BigData Europe and ApacheCon BigData North America.



Monday November 14, 2016 12:00 - 12:50
Giralda VI/VII

12:00

An Overview on Optimization in Apache Hive: Past, Present, Future - Jesús Camacho Rodríguez, Hortonworks
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer.

Speakers
avatar for Jesús Camacho Rodríguez

Jesús Camacho Rodríguez

Member of Technical Staff, Hortonworks
Jesús Camacho Rodríguez is a Member of Technical Staff at Hortonworks, the PMC chair of Apache Calcite, and a PMC member of Apache Hive. His current work focuses on extending and improving query processing and optimization, ensuring that the increasingly complex workloads supported by Hive are executed quickly and efficiently. Prior to that, Jesús obtained his PhD in Computer Science from Paris-Sud University and Inria, working on... Read More →



Monday November 14, 2016 12:00 - 12:50
Nervion/Arenal II/III

12:00

Machine Learning in Apache Zeppelin - Alexander Bezzubov, NF Labs
There are many Machine Learning projects available: Apache Mahout, Apache SystemML (incubating), Apache Spark MLlib, Tensorflow, Scikit-learn, etc.

In this session we are going to showcase how typical ML predictive analytics workflow can benefit from modern notebooks-style interactive environment like Apache Zeppelin. We going to share examples of successful integration between different project in big-data ecosystem and touch up on various techniques like visual recognition, NLP, and Deep Learning.


Speakers
AB

Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. | | Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.


Monday November 14, 2016 12:00 - 12:50
Santa Cruz

12:00

Managing Deeply Nested Documents in Apache Solr - Anshum Gupta, IBM Watson
Apache Solr in the recent past started supporting deeply-nested documents. Solr can now be used to perform search and faceting on documents such as nested email threads, comments and replies on social media, enriched and annotated documents etc. without having to flatten them before ingestion.

Anshum Gupta would discuss pre-processing of data so that it can be indexed in Solr, making it possible to perform complex search and statistical aggregation on top of it. He would also cover query formation for sample use cases of nested data and multiple options and features that Solr provides for faceting or aggregation of such documents.

By the end of this talk, Solr users would have a better understanding of both, how to work with features that Solr provides to find answers to interesting questions from deeply nested documents as well as work-arounds for the missing pieces.

Speakers
avatar for Anshum Gupta

Anshum Gupta

Sr. Software Engineer, IBM Watson
Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and also the co-creator of AWS CloudSearch - the first search as a service offering by AWS. He has spoken at multiple international conferences, including Apache Big Data... Read More →


Monday November 14, 2016 12:00 - 12:50
Giralda V

12:00

Building a Scalable Recommendation Engine with Apache Spark, Apache Kafka and Elasticsearch - Nick Pentreath, IBM
There are many resources available for using Apache Spark to build collaborative filtering models. However, there are relatively few for how to build a large-scale, end-to-end recommender system.



This talk will show how to create such a system, using Apache Kafka, Spark Streaming and Elasticsearch for data ingestion, real-time analytics and data storage, Spark DataFrames and ML pipelines for data aggregation and model building, and Elasticsearch for model management, serving and data visualization. We will also explore techniques for scaling model serving, using Spark Streaming for real-time model updates, and how to incorporate state-of-the-art models into this framework.



The talk will be technical and developer-focused, highlighting experiences from building real-world recommender systems, and providing example code (which will be available as open source).

Speakers
avatar for Nick Pentreath

Nick Pentreath

Principal Engineer, IBM
Nick is a Principal Engineer at IBM, working primarily on machine learning on Apache Spark. He is a member of the Apache Spark PMC and author of Machine Learning with Spark. Previously, he co-founded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match and Mxit. He is passionate about combining commercial focus with machine learning and cutting-edge technology to build... Read More →


Monday November 14, 2016 12:00 - 12:50
Giralda I/II

12:00

Property-based Testing for Spark Streaming - Adrian Riesco, Universidad Complutense de Madrid
Spark Streaming is currently one of the leading frameworks in the industry for distributed stream processing. However testing Spark Streaming programs is still a challenge, due to the complications of dealing with time. In this presentation, Adrian Riesco gives an introduction to sscheck, a testing library for Spark that extends ScalaCheck with additional temporal logic operators for generators and properties, that are used to define tests for Spark Streaming as linear temporal logic formulas, resulting in tests that are high level and easy to understand.

Speakers
avatar for Adrian Riesco

Adrian Riesco

PhD Assistant Professor, Facultad de Informatica (UCM)
I currently work as PhD Assistant Professor at Universidad Complutense de Madrid, Spain. I am also a member of the research group FADOSS, and my research interests include formal methods, logic, debugging, and testing.



Monday November 14, 2016 12:00 - 12:50
Nervion/Arenal I

13:00

Uber - Your Realtime Data Pipeline is Arriving Now! - Ankur Bansal, Uber
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.



Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.


Speakers
avatar for Ankur Bansal

Ankur Bansal

Sr. Software Engineer, Uber
Ankur Bansal is a senior engineer in Uber's Streaming team. He is currently focused on building Kafka infrastructure and scaling it to keep up with uber's hyper growth. His areas of interest include distributed systems and cloud. Before Uber he worked at eBay where he was part of the founding team that built eBay's Openstack based cloud and scaled it to 1000s of nodes. He's also a Committer and PMC member in Apache Kylin.



Monday November 14, 2016 13:00 - 13:50
Giralda III/IV

13:00

Geospatial Track: Crowd Learning for Indoor Navigation - Thomas Burgess, indoo.rs GmbH
indoo.rs enables location based services for indoor applications. With indoo.rs, developers can add new features to their products, including having locations trigger events, track assets, showing closest routes to other places. For this, we use WiFi/beacon radio infrastructure, mobile devices and our cloud which produce lots of geospatial time series data. The real-time indoor navigation fuses independent movement from custom 9D sensor fusion and position estimates obtained by comparing current signal readings to a reference map. This talk will discuss how we create and maintain these maps in our big data machine learning system which leverages crowd data through Kafka and Spark to run SLAM and context aware algorithms to create high quality trajectories. In addition to use in reference maps, these trajectories provide an additional input for our interactive analytics.

Speakers
avatar for Thomas Burgess

Thomas Burgess

Director of research, indoo.rs GmbH
Thomas is the CRO of indoo.rs and leads its research efforts since 2012. Earlier, he did his PhD in particle physics at Stockholm University for the AMANDA/IceCube neutrino telescopes, and worked as a postdoctoral researcher at University of Bergen for the ATLAS experiment at the LHC. At indoors he does modeling, statistics, data science, machine learning, and new algorithms for mobile indoor navigation. He has given numerous public talks for... Read More →


Monday November 14, 2016 13:00 - 13:50
Carmona

13:00

Apache S2Graph (incubating) as a User Event Hub - Hyunsung Jo, Daewon Jeong & Hwansung Yu, Kakao Corp.
S2Graph is a graph database designed to handle transactional graph processing at scale.

Its API allows you to store, manage and query relational information using edge and vertex representations in a fully asynchronous and non-blocking manner.

However, at Kakao Corp., where the project was originally started, we believe that it could be so much more.

There have been efforts to utilize S2Graph as the centerpiece of Kakaoäó»s event delivery system taking advantage of its strengths such as

- flexibility of seamless bulk loading, AB testing, and stored procedure features,

- multitenancy that allows interoperability among different services within the company,

- and most of all, the ability to run various operations ranging from basic CRUD to multi-step graph traversal queries in realtime with large volumes.

Speakers
avatar for Daewon Jeong

Daewon Jeong

Programmer, kakao
Works on S2Graph team
avatar for Hyunsung Jo

Hyunsung Jo

Kakao
Seoul-based developer interested in large scale data systems and cloud computing. | | Currently, working as a data systems developer at Kakao Corp., Korea with open source projects such as Apache S2Graph (incubating) and Druid among others. | | Previous work experience include software development at Siemens AG, Germany and Samsung Electronics, Korea. | | Presented a Lightning Talk at Apache: Big Data North America 2016, Vancouver, BC.
HY

Hwansung Yu

Kakao
Developer interested in LBS services and large scale data systems. |



Monday November 14, 2016 13:00 - 13:50
Giralda VI/VII

13:00

Hadoop, Hive, Spark and Object Stores - Steve Loughran, Hortonworks
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it's applications don't integrate that well äóîsomething which starts right down at the file IO operations.



This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the

foundational "what's an object store?" to the practical "what should I avoid" and the timely "what's new in Hadoop?" äóî the latter covering the improved S3 support in Hadoop 2.8+.



I'll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code äóîand equally, what they must avoid.



Finally, I'll look at ongoing work, especially "S3Guard" and what its fast and consistent file metadata operations promise.

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache Software Foundation, and a committer on the Hadoop core since 2009. He lives and works in Bristol, England.


Monday November 14, 2016 13:00 - 13:50
Nervion/Arenal II/III

13:00

Introducing Apache Apex: Next Gen Big Data Processing on Hadoop - Thomas Weise, DataTorrent
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:



* Architecture for high throughput, low latency and exactly-once processing semantics.

* Rich library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more

* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.

* Advanced engine features for auto-scaling, dynamic changes, compute locality.



Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.

Speakers
avatar for Thomas Weise

Thomas Weise

Architect, DataTorrent
Thomas is Apache Apex PMC chair and architect/co-founder at DataTorrent. He has developed distributed systems, middleware and web applications since 1997. Prior to DataTorrent he was in the Hadoop Team at Yahoo! and contributed to several of the ecosystem projects.


Monday November 14, 2016 13:00 - 13:50
Nervion/Arenal I

13:00

Distributed In-Database Machine Learning with Apache MADlib (incubating) - Roman Shaposhnik, Pivotal
Data science is moving with gusto to the enterprise, where data often resides in relational databases with SQL as the main workload. So how can an enterprise add a data science dimension to their business without a major IT re-architecture?

Apache MADlib (incubating) is an innovative SQL-based open source library for scalable in-database analytics. It provides parallel implementations of mathematical, statistical and machine learning methods. Bringing machine learning computations to the data makes for excellent scale out performance on massively parallel processing (MPP) platforms like Greenplum database and Apache HAWQ (incubating).

In this talk, we will describe the origin of MADlib, review the architecture and common usage patterns, and look ahead to some interesting plans around performance acceleration.


Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Pivotal Inc.
Roman Shaposhnik is a Director of Open Source at Pivotal Inc. He is a committer on Apache Hadoop, co-creator of Apache Bigtop and contributor to various other Hadoop ecosystem projects. He is also an ASF member and a former Chair of Apache Incubator. In his copious free time he managed to co-author "Practical Graph Analytics with Apache Giraph" and he also posts to twitter as @rhatr. Roman has been involved in Open Source software for more than a... Read More →


Monday November 14, 2016 13:00 - 13:50
Santa Cruz

13:00

Fast & Scalable Email System with Apache Solr - Strategies, Tradeoffs and Optimizations - Arnon Yogev, IBM Research
Email interaction has its unique characteristics and is different than traditional web search (for example in that users search their own private mailboxes and are often interested in recent emails rather than the archive).

Taking advantage of these characteristics, we were able to optimize our infrastructure in terms of indexing strategy and query optimization and achieve a significant gain in scalability and performance.

Arnon will present the various tradeoffs that were explored, including multi-tiered indexes, sorted indexes, query optimizations and more.

Arnon will then present the benchmark results that stress the importance of correctly designing a Solr infrastructure and tailoring it to oneäó»s specific use case.

Speakers
avatar for Arnon Yogev

Arnon Yogev

Software Developer & Researcher, IBM Research
Arnon is a software engineer in IBM Research, part of the Social Analytics & Technologies team, Big Data and Cognitive Analytics department. Arnon earned his MBA degree and his B.Sc in Computer Science from the Technion. Being part of the Social Analytics & Technologies team, Arnon's work experience includes social networks, search-based systems over Lucene and Solr and graph technologies. Previous lecturing experience includes Connected 2015... Read More →


Monday November 14, 2016 13:00 - 13:50
Giralda V

13:00

Building Apache Spark Application Pipelines for the Kubernetes Ecosystem - Michael McCune, Red Hat
Apache Spark based applications are often comprised of many separate, interconnected components that are a good match for an orchestrated containerized platform like Kubernetes. But with the increased flexibility afforded by these technologies comes a new set challenges for building rich data-centric applications.



In this presentation we will discuss techniques for building multi-component Apache Spark based applications that can be easily deployed and managed on a Kubernetes infrastructure. Building on experiences learned while developing and deploying cloud native applications on an OpenShift platform, we will explore common issues that arise during the engineering process and demonstrate workflows for easing the maintenance factors associated with complex installations.

Speakers
avatar for Michael McCune

Michael McCune

Senior Software Engineer, Red Hat
Michael is a software developer in Red Hat's emerging technology group. He is a contributor to, and core reviewer for the Oshinko project, the Sahara project, the OpenStack Security Project, and the OpenStack API Working Group. For the last two years he has been creating and deploying data-driven applications and infrastructure for the OpenStack and OpenShift platforms.


Monday November 14, 2016 13:00 - 13:50
Giralda I/II

13:50

15:30

Fighting Identity Theft: Big Data Analytics to the Rescue - Seshika Fernando, WSO2
Identity Theft is no longer just a consumeräó»s problem. Attackers are now targeting Enterprises for bigger financial gains and greater damage not just to the organizationäó»s infrastructure but more importantly to its corporate image.



While Enterprise Identity Theft Analytics Tools do exist, most organizations find it economically prohibitive to invest in expensive proprietary software. In this session, Seshika will show how a comprehensive Identity Theft Analytics Solution can be built using Open Source Technologies. She will demonstrate how Big Data Analytics can be used to safeguard any Enterprise by covering the 4 Aäó»s of Identity Analytics

äó¢ Authentication Analytics

äó¢ Authorization Analytics

äó¢ Audit Trail Analytics

äó¢ Adaptive Analytics

Speakers
SF

Seshika Fernando

Seshika is a Senior Technical Lead at WSO2 and focuses on the applications of WSO2’s middleware platform in Financial Markets. Throughout her career, she has had extensive experience in providing technology for Stock Exchanges, Regulators and Investment Banks from across the globe. Her current area of interest is in Real time anomaly detection and its usage in e-commerce. | | She holds a BSc (Hons) in Computer Science from the... Read More →


Monday November 14, 2016 15:30 - 16:20
Giralda III/IV

15:30

Performance Monitoring for the Cloud - Werner Keil, Agile Coach
Performance Monitoring tools like Performance Co-Pilot (PCP) existed almost longer than the World Wide Web. It was developed in the early 90s by SGI. Parts were made available open source from 2000 on, which led to a further spread of the tool. In recent years an active community formed and a variety of new features and enhancements were added. PCP is now part of Red Hat and SuSE Linux Enterprise editions and included in many other Linux distributions. Versions for other Unix variants, OS X and Windows also exist. This session compares popular Open Source Monitoring Tools like Performance Co-Pilot, StatsD, Dropwizard Metrics, Prometeus and Apache Sirona. How they each support Containers or Virtualization, share data with IT monitoring systems like Nagios or Zabbix, or process analyze and visualize it via Carbon, Graphite or Grafana/ElasticSerch.

Speakers
avatar for Werner Keil

Werner Keil

Director, Creative Arts & Technologies
Werner Keil is Agile Coach Java and IoT/Embedded expert. Helping Global 500 Enterprises across industries and leading IT vendors. He worked for over 25 years as Program Manager, Coach, SW architect and consultant for Finance, Mobile, Media, Tansport and Public sector. Werner is Eclipse and Apache Committer and JCP member in JSRs like 354 (Money), 358/364 (JCP.next), Java ME 8, 362 (Portlet 3), 363 (Units, also Spec Lead), 365 (CDI 2), 366 (Java... Read More →


Monday November 14, 2016 15:30 - 16:20
Santa Cruz

15:30

Processing Planetary Sized Datasets - Tim Park, Microsoft
In my group at Microsoft, we have worked with the United Nations, Guide Dogs for the Blind in the UK, several automotive companies, and Strí_er on a number of projects involving high scale geospatial data.



In this talk, I'll share some of the best practices and patterns that have come out of those experiences: best practices for storing and indexing geospatial data at scale, incremental ingestion and slice processing of the data, and efficiently building and presenting progressive levels of detail.



The audience will walk away with an understanding of how to efficiently summarize data over a geographic area, general methods for doing ingestion with Apache Kafka (or other event ingestion systems), and incremental updates to large scale datasets with Apache Spark, and best practices around visualizing this data on the frontend.

Speakers
TP

Tim Park

Principal Software Engineer
Tim is a Principal Software Engineer at Microsoft and works with customers and partners to help them utilize open source platforms on Microsoft’s Azure cloud. He has a particular focus on big data, and, in particular, processing large scale geospatial data. His project experience in this space several global car manufacturers, outdoor display advertiser Stroeer, several Internet of Things startups, and detecting humanitarian crisis... Read More →


Monday November 14, 2016 15:30 - 16:20
Carmona

15:30

Moven: Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure - Sergio Fernandez, Redlink GmbH
Modern NLP pipelines use large models that need to be distributed across all the processing infrastructure. For example, in the SSIX project we're managing models of several GBs for the financial sector. At that scale you can't assume the models will be transferred at task submission time, neither manually. From our research, it doesn't look to be any well-accepted approach to solve this issue (e.g., TensorFlow simply uses git).

Moven (models+maven) is a proof-of-concept implemented relying on the Maven infrastructure to publish machine/deep learning models. The current implementation allows to make use of them in both Java and Python. Although we're targeting more specific needs of some concrete environments, such as Apache Spark or Apache Beam Runners API.

Further details at https://bitbucket.org/ssix-project/moven

Speakers
avatar for Sergio Fernández

Sergio Fernández

Software Engineer, Redlink GmbH
I'm a Software engineer specialized in innovation, with a focus on Data Architectures. My interests include Distributed Architectures, Data Integration, Linked Data and System Engineering. I've worked as software engineer and project manager in different industries, but always somehow close to science; because I strongly believe that innovation can be achieved by equally using research and engineering. Therefore all my scientific contributions... Read More →



Monday November 14, 2016 15:30 - 16:20
Nervion/Arenal II/III

15:30

Large Scale SolrCloud Cluster Management via APIs - Anshum Gupta, IBM Watson
Apache Solr is widely used by organizations to power their search platforms and often support multiple users. A lot of cluster management APIs were introduced over the last few releases, allowing the users to to manage operations ranging from replica placement to forcing leader elections via API calls. At the end of this talk, intermediate Solr users would understand what's available, and when can they avoid direct interference with the system, leading to more stable clusters and lower chances of nodes going down. The attendees would also be much better equipped to build their own SolrCloud cluster management tools. I would also talk about when not to use these APIs and what's planned in the near future to handle specific operational use cases.

Speakers
avatar for Anshum Gupta

Anshum Gupta

Sr. Software Engineer, IBM Watson
Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and also the co-creator of AWS CloudSearch - the first search as a service offering by AWS. He has spoken at multiple international conferences, including Apache Big Data... Read More →


Monday November 14, 2016 15:30 - 16:20
Giralda V

15:30

Open Source Operations: Building on Apache Spark with InsightEdge, TensorFlow, Apache Zeppelin, and/or Apache - Samuel Cozannet, Canonical
As software becomes more free and open it also is becoming more complex and expensive to operate. How can we as an Open Source community distill best practices, and recommended operations to model complex interconnected services so users can focus on their ideas? How can we as developers deliver recommended best practices in our applications and when connected to other applications so users are free to contribute and use the project on their choice of substrate (laptop, cloud, or bare metal [x86, ARM, ppc64el, s390x]).

In this talk we explore how Juju can provide an Open Source method to model a multi-node Apache Spark cluster across a diverse set of substrates, and start adding other services to build additional solutions. This talk will include a demo, and users should be able to take all software shown to try for themselves in a free and Open Source manner.

Speakers

Monday November 14, 2016 15:30 - 16:20
Giralda VI/VII

15:30

Scalable Data Science in R and Apache Spark 2.0 - Felix Cheung, Committer
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? In this talk we will walkthrough many examples how several new features in Apache Spark 2.0.0 will enable this. We will also look at exciting changes coming next in Apache Spark 2.0.1 and 2.1.0.




Speakers
avatar for Felix Cheung

Felix Cheung

Principal Engineer, Microsoft
Felix Cheung is a Committer of Apache Spark and a PMC/Committer of Apache Zeppelin. He has been active in the Big Data space for 3+ years, he is a co-organizer of the Seattle Spark Meetup, presented several times and he was a teaching assistant to the very popular edx Introduction to Big Data with Apache Spark, and Scalable Machine Learning MOOCs in the summer of 2015. He had a tutorial session at ApacheCon: Big Data North America 2016



Monday November 14, 2016 15:30 - 16:20
Giralda I/II

15:30

Streaming Report: Functional Comparison and Performance Evaluation - Huafeng Wang, Intel
Streaming Report (Mao Wei, Intel) - Streaming processing technology developed so fast recently. Spark Streaming, Flink, Storm, Heron, Gearpump, so many choices are available when people want to pick up the proper one to resolve their real business problems. In this presentation, Mao Wei will go thought all of these different frameworks and compare them in detail. From functional aspect, Wei will discuss underlying mechanism of these frameworks and review several function points which users may care about generally. And from practical aspect, you will see a performance test result based on HiBench, which is a cross platforms micro benchmark suite for big data open sourced by Intel BDT. The test cases include identity, repartition, state operation and window operation.

Speakers
avatar for Huafeng Wang

Huafeng Wang

Software engineer, Intel
Huafeng is a software engineer from Intel's Big Data engineering group, as well as a committer of Apache Gearpump, which is an open sourced streaming process engine initiated by Intel.



Monday November 14, 2016 15:30 - 16:20
Nervion/Arenal I

16:30

How Big Data/IoT Leverage the Power of OpenSource to Solve Healthcare Use Cases - Manidipa Mitra, ValueLabs
This session will talk about how a Digital Health Care Mgmt platform can be built (using different open source technologies like Kafka,Spark Streaming,HBase,Hive,pySpark,Mirth)to collect patient data,clinical data(HL7 data),claims data,real-time wearables data and create a 360 view/insights for a patient's health risk and conditions. Also it will talk about how to built a generic platform(by scraping blogs,message board, articles, using an open source called Scrapy.Ingesting fb,twitter data,store,analyse,index,built social-sentiments,create word cloud,segment messages using open source like spark,HBase,Hive,python,Solr)to find out a Key Opinion Leader for a particular disease discussion in social media and how to provide insights/social-sentiments and search capabilities on different medicines used for particular disease/treatment to get feedback on medicines or for research purpose..

Speakers
avatar for Manidipa Mitra

Manidipa Mitra

Director, ValueLabs
Manidipa Mitra heads the Big Data CoE in ValueLabs having extensive experience in building industry specific solution using distributed computing and cloud technologies . Having 16+ years of software industry experience and in-depth knowledge on disruptive-technologies, Cloud and Storage . She is holding dual graduate degree in Physics and Computer science. Manidipa previously Invited as Speaker in Grace Hopper Conference 2013, and presently... Read More →


Monday November 14, 2016 16:30 - 17:20
Giralda III/IV

16:30

Interactive Analytics at Scale in Apache Hive Using Druid - Jesús Camacho Rodríguez, Hortonworks
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications. However, it does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries.

Speakers
avatar for Jesús Camacho Rodríguez

Jesús Camacho Rodríguez

Member of Technical Staff, Hortonworks
Jesús Camacho Rodríguez is a Member of Technical Staff at Hortonworks, the PMC chair of Apache Calcite, and a PMC member of Apache Hive. His current work focuses on extending and improving query processing and optimization, ensuring that the increasingly complex workloads supported by Hive are executed quickly and efficiently. Prior to that, Jesús obtained his PhD in Computer Science from Paris-Sud University and Inria, working on... Read More →



Monday November 14, 2016 16:30 - 17:20
Nervion/Arenal II/III

16:30

Integrators at Work! Real-Life Applications of Apache Big Data Components - Moderated by Phil Archer, W3C
The event will offer both insights and a hands-on opportunity to learn about and try out the Big Data Platform devised by the BigDataEurope (BDE) project. It's most striking feature is the ease with which many Apache big data components like Apache Spark, Flink, Kafka, Cassandra and Solr can be instantiated through a simple UI thanks to the project's use of Docker containers and Docker Swarm.

BDE is producing 7 pilot instances aligned with the 7 H2020 Societal Challenges (SC), each of which targets a real-world use-case. Things like handling the 2GB of data produced each day by a single typical wind turbine; data mining academic journals and matching the named entities with further information about them including images; tracking changes in land use and matching them with social and professional media feeds. Many of these use cases depend on another key feature of the BigDataEurope platform - the semantification of big data.

Participants will also have the opportunity to shape the next stage of the BDE platform, based on their unique skills and experiences with the Apache technology.

Moderators
avatar for Phil Archer

Phil Archer

Data Strategist, W3C
Phil Archer is the Data Strategist at W3C, the industry standards body for the World Wide Web, coordinating W3C's work in the Semantic Web and related technologies. He is most closely involved in the Data on the Web Best Practices, Permissions and Obligations Expression and Spatial Data on the Web Working Groups. His key themes are interoperability through common terminology and URI persistence. As well as work at the W3C, his career has... Read More →

Speakers
AC

Angelos Charalambidis

Researcher, NCSR “Demokritos”
Angelos Charalambidis is a postdoctoral researcher in the Data Engineering Group of the Institute of Informatics at NCSR “Demokritos”. He received his PhD in Programming Languages and his main interests include declarative programming languages, big data systems optimisations and topics of the Semantic Web.
avatar for Hajira Jabeen

Hajira Jabeen

Senior Researcher, University of Bonn
She is a work package lead and coordinator for the Big Data Europe. | Her research interests are Big Data, Structured Machine Learning, Semantic Web, Data Mining and Evolutionary Computation.
avatar for Axel-Cyrille Ngonga Ngomo

Axel-Cyrille Ngonga Ngomo

Head of Research Group, INFAI
Head of AKSW (http://aksw.org) at University of Leipzig/InfAI, a research group with ca. 50 members. Author of 120+ research papers and 20+ presentations are top-tier conferences. Received manifold research awards including Next Einstein Forum award 2016, 12 best research paper awards and competition wins. Coached by Lisa Shufro (ex-TED coach) in speaking. Currently head of the HOBBIT project (http://project-hobbit.eu), which focuses on... Read More →
SS

Simon Scerri

BDE Deputy Coordinator, Fraunhofer IAIS
Simon Scerri is a senior postdoc in the “Enterprise Information Systems” department at Fraunhofer IAIS and at the University of Bonn. In 2011, Simon received his Ph.D. from the Faculty of Engineering at the National University of Ireland, Galway. Prior to joining Fraunhofer, Simon contributed to research efforts (2005–2013) at the Digital Enterprise Research Institute (DERI). In 2014–2015 Simon was an ERCIM fellow... Read More →


Monday November 14, 2016 16:30 - 17:20
Giralda I/II

16:30

SystemML - Declarative Machine Learning - Luciano Resende, IBM
Machine learning in the enterprise is an iterative process. Data scientists will tweak or replace their learning algorithm in a small data sample until they find an approach that works for the business problem and then apply the Analytics to the full data set. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems.Think of SystemML as SQL for Machine Learning, it provides a high-level language to quickly implement and run algorithms, and it also enable cost-based optimizer that takes care of low-level decisions about parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will introduce you to SystemML This talk will introduce you to SystemML and get you started building declarative analytics with SystemML using a Zeppelin notebooks.

Speakers
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for Apache Bahir, and also spend time mentoring newly created Apache Incubator projects. At IBM, he contributed to several IBM big data offerings, including BigInsights... Read More →


Monday November 14, 2016 16:30 - 17:20
Santa Cruz

16:30

ETL Pipelines with OODT, Solr and Stuff - Tom Barber, Meteorite Consulting
Discover a number of Apache projects you may not have heard of and how they can help you process both Clinical and non Clinical data. Apache OODT developed by NASA allows users to ingest and store files and metadata along with process workflows. OODT along with CTakes allows us to extract clinical information from files and then process them and allow end users access to the extracted data.



We can then take these sources and manipulate them further creating a highly flexible ETL pipeline offering reliability and scalability. Backed by Apache SOLR users can then interrogate the data via web interfaces and instigate further post processing and investigation.



Of course you may not have a clinical use case, but the platforms can be repurposed and will allow you to go away and build your own, scalable data pipeline for processing and integstion.

Speakers
avatar for Tom Barber

Tom Barber

Technical Director, Meteorite Consulting
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals with Devops and data processing systems for customers and clients, both in the UK, Europe and also North America.


Monday November 14, 2016 16:30 - 17:20
Giralda V

16:30

Deep Neural Network Regression at Scale in Spark MLlib - Jeremy Nixon, Spark Technology Center
Deep Neural Network Regression at scale in Spark MLlib - Jeremy Nixon will focus on the engineering and applications of a new algorithm in MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it's trained. Major aspects of that are the compositional transformations over the data, advantages of the various  activation functions, the final linear layer, the cost function and training via backpropagation. Applications will look into how to use neural network regression to model data in computer vision, finance, and the environment. Details around optimal preprocessing, the type of structure that can be found, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face. 

Speakers
avatar for Jeremy Nixon

Jeremy Nixon

Machine Learning Engineer, Spark Technology Center
"I'm a Machine Learning Engineer at the Spark Technology Center, focused on scalable deep learning. I contribute to MLlib at the STC, which I joined after graduating from Harvard College concentrating in Applied Mathematics and Computer Science.


Monday November 14, 2016 16:30 - 17:20
Giralda VI/VII

16:30

Myriad, Spark, Cassandra, and Friends - Big Data Powered by Mesos - Jörg Schad, Mesosphere
Processing Big Data necessitates large compute cluster. And large clusters -especially when running multiple Big Data systems- require some kind of cluster manager and cluster scheduler.

In this talk, we will give an overview how Apache Mesos and DC/OS help solve the problems of large scale clusters and then take a look at the current state of the Big Data ecosystem built on top of this foundation.

We will discuss differences between Apache Yarn and Apache Mesos and why -thanks to Apache Myriad- they are not exclusive choices.

Furthermore, we will look at the growing Big Data ecosystem on top of Apache Mesos and DC/OS including, for example, Apache Spark, Apache Cassandra, and Apache Kafka.

Finally, we will also provide some insights into future developments, both for the foundation (i.e., Apache Mesos and DC/OS) as well as the Big Data ecosystem on top.

Speakers
avatar for Jörg Schad

Jörg Schad

Mesosphere
Jörg is a software engineer at Mesosphere in Hamburg. In his previous life, he implemented distributed and in memory databases and conducted research in the Hadoop and Cloud area. His speaking experience includes various meetups, international conferences, and lecture halls.


Monday November 14, 2016 16:30 - 17:20
Carmona

16:30

Real Time Aggregates in Apache Calcite -- Optimal Use of your Streaming Data - Atri Sharma, Microsoft
The talk shall focus on how to develop applications in real time analytics space using Apache Calcite's advanced query planning capabilities. The talk shall give a small overview of Calcite's planner and rules engine and then proceed to discuss the capabilities that can be used to develop real time applications that continuously stream data and process them. The talk shall be discussing the ongoing work in Calcite's framework and the upcoming streaming aggregation features that will be present soon. The talk shall also focus on Calcite's highly adaptable framework that allows Calcite to work with many existing projects and how your current application can take advantage of Calcite' s planning and aggregation capabilities.

Speakers
avatar for Atri Sharma

Atri Sharma

Software Engineer, Azure Data Lake, Microsoft
An Apache Apex committer where he is engaged in designing and implementing next generation features and performing reviews.A learning PostgreSQL hacker who is currently engaged in various aspects of Postgres.He has been an active contributor,implementing ordered set functions, implementing grouping sets in Postgresql, improving sort and hashjoin performance and OLAP performance. He is also a committer for Apache HAWQ, Apache MADLib and has been... Read More →


Monday November 14, 2016 16:30 - 17:20
Nervion/Arenal I

17:30

BoF Space Available - Book Now! (Space is Limited)
Are you passionate about a topic and want to share that with others? If so, sign up to lead a Birds of a Feather (BoF) session. Instead of passive listening, all attendees and organizers are encouraged to become participants, with discussion leaders providing moderation and structure for attendees. To sign up for a BoF Session, please book through the form. You will select the time and then be prompted to enter your BoF details.

Monday November 14, 2016 17:30 - 18:30
TBA

17:30

BoF: Apache Beam and You! - Jean-Baptiste Onofré, Talend & ASF and Dan Halperin, Google
Apache Beam is a unified programming model for big data processing. Especially Beam is composed around three visions and kind of users: if the end users are the pipeline writers, the SDK & DSL writers, and the runner writers can be you, as contributor to other Apache projects! 

Speakers
DH

Dan Halperin

Google
Dan Halperin is a PPMC member and committer on Apache Beam (incubating). He has worked on Beam and Google Cloud Dataflow for 18 months. Prior to that, he was the Director of Research for Scalable Data Analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in oceanography, astronomy, medical informatics, and the life sciences. Dan received his Ph.D. in Computer Science and Engineering at the... Read More →
JO

Jean-Baptiste Onofré

Apache Software Foundation
JB is Apache Beam's champion and a member of the Beam PPMC. He is a long-tenured Apache Member, serving on as PMC/committer for 20 projects that range from integration to big data. | | Dan is a Beam PPMC member, committer, and a Google software engineer working on Apache Beam and the Google Cloud Dataflow runner for Beam.


Monday November 14, 2016 17:30 - 18:30
Nervion/Arenal II/III

17:30

BoF: Apache Branding Policies & Trademarks - Shane Curcuru, Apache Software Foundation
Come meet Shane, VP, Brand for the ASF, to ask your questions about how Apache projects are branded and how you can respect Apache trademarks!
https://s.apache.org/trademarks

Speakers
avatar for Shane Curcuru

Shane Curcuru

VP, Brand Management, The Apache Software Foundation
Shane serves as V.P. of Brand Management for the ASF, setting trademark and brand policy for all 250+ Apache projects, and has served as five-time Director, and member and mentor for Conferences and the Incubator. | | Shane's Punderthings consultancy is here to help both companies and FOSS communities understand how to work together better. At home, Shane is: a father and husband, a Member of the ASF, a BMW driver and punny guy. Oh, and we... Read More →


Monday November 14, 2016 17:30 - 18:30
Nervion/Arenal I

17:30

BoF: Apache Way as a Cultural Template - Dzmitry Pletnikau, Unicity Intl
Organizations operate within the framework of rules: written or implicit. These rules form the "culture". Manager's or executive's job is to shape and guide the evolution of these rules. Can Apache Way be used as a drop-in cultural template in any organization producing intellectual assets?

Speakers
DP

Dzmitry Pletnikau

Unicity Intl
Growing up in Belarus, Dzmitry developed an early interest in natural sciences and computer programming. After receiving a degree in Physics, Dzmitry chose to freelance as a programmer. Five years later he settled as a Software Architect at Unicity, in Utah. | | | | Dzmitry is responsible for moving the enterprise from 20-year-old systems into the 21st century with no downtime and no interruptions. He is breaking the Cobol-based monolith... Read More →



Monday November 14, 2016 17:30 - 18:30
Santa Cruz

17:30

BoF: Open Source Beyond Software - Alexander Bezzubov, NF Labs
Open source software in general, and Apache Software Foundation in particular is a great example of how principles below have changed the whole industry:  
  • Permissive licensing
  • Open governance 
  • Distributed networks of collaborators
  • Work, guided by one's desire 
Same principles begin to be are applied to other aspects of life by different communities around the globe
  • Open hardware 
  • Makers 
  • Publishing
  • DIYbio
  • Housing 

As well as some more traditional cultural phenomenon similar in spirit:
  • Shanzhai (Chinese: 山寨) 
  • Kibbutz (Hebrew: קִבּוּץ / קיבוץ) 

Let's explore existing initiative and see where it can lead us together!

Speakers
AB

Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. | | Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.


Monday November 14, 2016 17:30 - 18:30
Carmona
 
Tuesday, November 15
 

07:00

Morning Run
Come meet in the lobby of the Melia Sevilla at 7:00am for a morning run. The plan is to cross to the park and run next to the river. 

This will last an hour and the group will be back by 8:00am.

Tuesday November 15, 2016 07:00 - 08:00
Melia Sevilla Hotel Lobby

08:30

Breakfast
Tuesday November 15, 2016 08:30 - 09:30
Giralda Foyer

08:30

Sponsor Showcase
Tuesday November 15, 2016 08:30 - 15:30
Triana Foyer

08:30

Registration
Tuesday November 15, 2016 08:30 - 18:00
Triana Foyer

09:30

Keynote: Hadoop Infrastructure @Uber Past, Present and Future - Mayank Bansal, Sr. Engineer, Uber
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 1000 and In future with help of Apache Mesos, Myriad and Hadoop to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads).

Speakers
avatar for Mayank Bansal

Mayank Bansal

Software Engineer, Uber
Mayank Bansal is currently working as a Sr Engineer at Uber in data infrastructure team. he is Apache Hadoop Committer and Oozie PMC and Committer. Previously he was working at ebay in hadoop platform team leading YARN and MapReduce effort. Prior to that he was working at Yahoo and contributed a lot on Oozie.



Tuesday November 15, 2016 09:30 - 09:50
Giralda I/II

09:55

Keynote: The ASF's Big Tent - Sean Owen, Director of Data Science, Cloudera
The ASF is going stronger than ever: more projects, contributors, corporations under an increasingly big tent. While the ASF facilitates software development on its surface, the ASF is more than just a Github. Its collaboration model and people drive success and longevity of projects and the foundation. Together in person we should strengthen community bonds.

Speakers
avatar for Sean Owen

Sean Owen

Director of Data Science, Cloudera
Sean is Director of Data Science, based in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time learning on Hadoop. He is on the Apache Spark PMC and co-author of Advanced Analytics with Spark. Previously, Sean | was a senior engineer at Google. He holds an MBA from London Business School and a BA from Harvard University.


Tuesday November 15, 2016 09:55 - 10:10
Giralda I/II

10:10

Coffee Break
Tuesday November 15, 2016 10:10 - 11:00
Giralda Foyer

11:00

Real Use Cases of Kappa Architecture - Juantomas Garcia, Open Sistemas
During the lasts three years we use kappa architecture in almost all our projects. We want to show how kappa architecture fixed in very different size projects. Kappa architecture is not the silver bullets for every project but is very likely ...

Speakers
avatar for Juantomas Garcia

Juantomas Garcia

Data Solutions Manager, Open Sistemas
President Hispalinux (Spanish User Local Group) (1999-2007) Author of the book "La Pastilla Roja" the first book in spanish about free software (2004) More than 200 lectures around the world. Now CDO of Open Sistemas and advocate of Apache Spark and Kappa Architecture. Organize of Machine Learning Spain Meetup.


Tuesday November 15, 2016 11:00 - 11:50
Nervion/Arenal I

11:00

Building a Robust Analytics Platform with an Open-Source Stack - Dao Mi & Alex Kass, DigitalOcean
As modern enterprises migrate to microservice-centric cloud architecture, it has become imperative to build a new data analysis framework to handle äóñ often in real time - the event-based data these services produce. For this presentation, we will demonstrate how to leverage multiple open source projects to build a robust framework quickly and cheaply that can scale with an organization as it grows and inexorably generates more and more data.

They will cover a tangible, real-world, implementation that includes Apache technologies such as Kafka, Mesos, and Spark, as well as open-source PrestoDB (Facebook).



The speakers will discuss lessons learnt during and after the build, as well as

some specific use-cases for how this approach brought about otherwise-unattainable actionable business insights and results, including hardware failure prediction and capacity planning.

Speakers
avatar for Alex Kass

Alex Kass

Data Science Engineer, DigitalOcean
Alex Kass has worked at companies ranging from large financial institutions to early-stage startups, regularly building successful analytical models and systems of varying size. Now at DigitalOcean, he has at his disposal sufficient software and hardware firepower (& startup autonomy) to experiment and build with both stable and cutting edge technologies, delivering actionable statistical insights at scale.
avatar for Dao Mi

Dao Mi

Data Engineer, Digital Ocean
Dao Mi has extensive experience working with data of different scale, type and velocity across myriad industries, from natural gas to floating bonds. While at Microsoft, he helped deliver BI and predictive analytics solutions to Fortune 500 clients. He has helped build a large custom analytics platform for use across DigitalOcean, a fast-growing global cloud hosting company, where he works presently. He has spoken previously at TechReady for... Read More →


Tuesday November 15, 2016 11:00 - 11:50
Giralda V

11:00

Crawling the Web for Common Crawl - Sebastian Nagel, Common Crawl
Common Crawl is non-profit organization which regularily crawls a significant sample of the web and makes the data accessible free charge to everyone interested in running machine-scale analysis on web data. The presentation will demonstrate how to use the Common Crawl data covering data formats and tools as well as examples and derived datasets. The monthly crawls are run by Apache Nutch on Apache Hadoop. Sebastian will also share his experience from running a web-scale crawl on a small budget.

Speakers
avatar for Sebastian Nagel

Sebastian Nagel

Crawl Engineer, commoncrawl.org
Sebastian Nagel works as crawl engineer at Common Crawl, a non-profit organization that makes web data freely accessible to everyone. Prior to joining Common Crawl he implemented search and data quality solutions at Exorbyte. Sebastian is a committer and PMC of Apache Nutch, a scalable web crawler, and presented the project at ApacheCon 2014.



Tuesday November 15, 2016 11:00 - 11:50
Giralda III/IV

11:00

Apache HBase: Overview and Use Cases - Apekshit Sharma, Cloudera
NoSQL databases are critical in building Big Data applications. Apache HBase, one of the most popular NoSQL databases, is used by Facebook, Apple, eBay and hundreds of other enterprises to store, analyze and profit from their petabyte-scale volume of data. This talk will discuss

- motivation behind NoSql databases

- basic architecture of a popular NoSql system, Apache HBase

- some commonly seen big data usage patterns in industry, and when & how to use Apache HBase (or other better suited NoSQL database).

Speakers
AS

Apekshit Sharma

Software Engineer, Cloudera Inc
Apekshit Sharma (Appy) is a Software Engineer at Cloudera, and contributor of Apache HBase. Prior, he was at Google building backend infrastructure using Map-Reduce, Bigtable & Millwheel. He earned his B.Tech in Computer Science from Indian Institute of Technology, Bombay. Currently he is working on performance framework and dynamic configuration framework in Apache HBase. He has also given tutorials in NoSqlNow! and JavaOne conferences.


Tuesday November 15, 2016 11:00 - 11:50
Carmona

11:00

Native and Distributed Machine Learning with Apache Mahout - Suneel Marthi, Red Hat
Data scientists love tools like R and Scikit-Learn since they are declarative and offer convenient and intuitive syntax for analysis tasks but are limited by local memory, Mahout offers similar features with near seamless distributed execution.

In this talk, we will look at Mahout-Samsara's distributed linear algebra capabilities and demonstrate the same by building a classification algorithm for the popular 'Eigenfaces' problem using the Samsara DSL from an Apache Zeppelin notebook. We will demonstrate how a simple classification algorithm may be prototyped and executed, and show the performance using Samsara DSL with GPU acceleration. This will demonstrate how ML algorithms built with Samsara DSL are automatically parallelized and optimized to execute on Apache Flink and Apache Spark without the developer having to deal with the underlying semantics of the execution engine.

Speakers
SM

Suneel Marthi

Principal Engineer, Red Hat
Suneel Marthi is a member of the Apache Software Foundation, and a Committer and PMC member on Apache Mahout, Apache PredictionIO and Apache Pirk. He has previously presented at Apache Big Data, Hadoop Summit Europe and Flink Forward.


Tuesday November 15, 2016 11:00 - 11:50
Santa Cruz

11:00

Smart Manufacturing with Apache Spark Streaming and Deep Learning - Prajod Vettiyattil, Wipro Technologies
Even after a century of the Industrial Revolution, manufacturing processes even within assembly lines, involve manual steps requiring costly human intervention. Eg:Product quality inspection. With the advent of machine learning and big data tools, it has become possible to automate many of these manual processes. What is more, such solutions can surpass human capability for manual quality inspection. In this session we will look at a few examples of how products on assembly lines can be monitored for quality, using image processing techniques combined with machine learning. The solution to be presented, is built using a combination of machine learning and deep learning techniques running on Apache Spark Streaming.

The presentation will also explain the steps involved in creating such a solution: mapping a business need to a ML based technical solution

Speakers
avatar for Prajod Vettiyattil

Prajod Vettiyattil

Architect, Wipro Technologies
Prajod is a senior Architect, with 20 years of experience, in the open source group of Wipro, responsible for research and solution development in the area of Big Data and Analytics. He has presented at multiple open source conferences (Open Source India Days, Great Indian Developer Summit, JUDCon, WSO2Con). He has also written articles on technology, in online forums and print magazines. See his Linkedin page, for presentation and article... Read More →


Tuesday November 15, 2016 11:00 - 11:50
Giralda VI/VII

11:00

Introducing Apache CouchDB 2.0 - Jan Lehnardt, Neighbourhoodie Software
A thorough introduction to CouchDB 2.0, the five-years-in-the-making final delivery of the larger CouchDB vision.



Apache CouchDB 2,0 finally puts the C back in C.O.U.C.D.B: Cluster of unreliable commodity hardware. With a production-proofed implementation of the Amazon Dynamo paper, CouchDB has now high-availability, multi-machine clustering as well scaling options built-in, making it ready for Big Data solutions that benefit from CouchDBäó»s unique multi-master replication.


Speakers
avatar for Jan Lehnardt

Jan Lehnardt

CEO, Neighbourhoodie Software
Jan Lehnardt is the PMC Chair and VP of Apache CouchDB, co-creator of the Hoodie web app framework based on CouchDB as well as the founder and CEO of Neighbourhoodie Software. He’s the longest standing contributor to Apache CouchDB.


Tuesday November 15, 2016 11:00 - 11:50
Nervion/Arenal II/III

11:00

A Java Implementer's Guide to Boosting Apache Spark Performance - Tim Ellison, IBM
Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.

Speakers
avatar for Tim Ellison

Tim Ellison

Tim Ellison is currently a Senior Technical Staff Member with IBM's Java Technology Centre in the UK. He has worldwide responsibility for Open Source Engineering in the Java SDK underpinning a broad selection of IBM's flagship products. He is a Member of the Apache Software Foundation, a PMC member of Apache Pirk, and has been a Vice President of the Apache Software Foundation and chair of the Apache Harmony Project. Tim is an emeritus team... Read More →


Tuesday November 15, 2016 11:00 - 11:50
Giralda I/II

12:00

Apache Ignite - Path to Converged Data Platform - Dmitriy Setrakyan, GridGain
Apache Ignite is one of the fastest growing apache projects. The presentation will take the audience on a roadmap discovery of Ignite moving to a converged storage model, supporting both, analytical and transactional data sets. We will go over the differences between Fast Data and Big Data and cover the projects supporting both technologies. We will discuss the reasons, real-life use cases and technology approaches for merging Fast Data and Big Data in order to deliver a consistent & universal data processing platform regardless of where data resides relative to HDD, flash or DRAM.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy... Read More →


Tuesday November 15, 2016 12:00 - 12:50
Giralda V

12:00

Apache Spark: Enterprise Security for Production Deployments - Vinay Shukla, Hortonworks
Spark is being deployed in production by many enterprises. With enterprise traction comes enterprise security requirements and the need to meet enterprise security standards.



The sessions walks through enterprise security requirements, provides deep of dive Spark security features and shows how Spark meets these enterprise security requirements.



The talks go on uncovering the entire gamut of security in Spark from Kerberos, Authentication, Authorization, Audit to Encryption with Spark. The session will provide deep dive on all existing security features in Spark and will also outline to future security work planned in the Apache Spark community.

Speakers
VS

Vinay Shukla

Director, Product Management, Hortonworks
Vinay Shukla is the Director of Product Management for Spark & Zeppelin at Hortonworks. Previously, Vinay has worked as Developer and Security Architect. Vinay has given talks at Hadoop Summit (2x), Apache Con Big Data - Europe (2015), JavaOne & Oracleworld. His most recent talk was at Hadoop Summit Dublin (April, 2016) -Running Spark in Production. Please see video of that talk at (https://youtu.be/OkyRdKahMpk) and slides at... Read More →


Tuesday November 15, 2016 12:00 - 12:50
Nervion/Arenal I

12:00

Create a Hadoop Cluster and Migrate 39PB Data Plus 150000 Jobs/Day - Stuart Pook, Criteo
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and >100000 jobs per day. This cluster was critical in both stockage and compute but without backups. This talk describes: 0/ the different options considered when deciding how to protect our data and compute capacity 1/ the criteria established for the 800 new computers and comparison tests between suppliers' hardware 2/ the non-blocking network infrastructure with 10 Gb/s endpoints scalable to 5000 machines 3/ the installation and configuration, using Chef, of a cluster on new hardware 4/ the problems encountered in moving our jobs and data from the old CDH4 cluster to the new CDH5 cluster 600 km distant 5/ running and feeding with data the two clusters in parallel 6/ fail over plans 7/ operational issues 8/ the performance of the 16800 core, 200 TB RAM and 60 PB disk CDH5 cluster.

Speakers
avatar for Stuart Pook

Stuart Pook

Senior DevOps Engineer, Criteo
Stuart loves storage (100 PB at Criteo) and is part of Criteo's Lake team that runs some small and two huge Hadoop clusters. He also loves automation with Chef because configuring 2000 Hadoop nodes by hand is just too slow. He has spoken at Devoxx2016 and at Criteo Lab's NABD Conference 2016. Before discovering Hadoop he developed user interfaces for biotech companies.



Tuesday November 15, 2016 12:00 - 12:50
Giralda VI/VII

12:00

The Original Vision of Nutch, 14 Years Later: Building an Open Source Search Engine - Sylvain Zimmer, Common Search
Few people remember that before spinning off Hadoop and focusing on crawling, Nutch was meant to be an alternative to commercial search engines. What if we tried to do it again today?



In this presentation, Sylvain Zimmer will explain how he used projects from the Nutch diaspora like Spark and Elasticsearch to build Common Search, an open source search engine with transparent rankings.



We will go over the architecture of large-scale search engines and how it has evolved since the late 90s. Then we will review the tools from the Apache and open source ecosystems that are best suited to solve the many challenges at hand. Finally, we will discuss what lies ahead for Common Search before it can be useful to the general public.

Speakers
SZ

Sylvain Zimmer

Founder, Common Search
Sylvain Zimmer is a software developer and longtime free culture advocate. In 2004 he founded Jamendo, the largest Creative Commons music community online. Since 2012, he has been the CTO of Pricing Assistant, a startup specialized in large-scale crawling of E-commerce websites. He is also the founder and main curator of dotConferences, a series of TED-like developer events in Paris. | | More recently, he started Common Search, an ambitious... Read More →


Tuesday November 15, 2016 12:00 - 12:50
Giralda III/IV

12:00

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL Data Systems - Christian Tzolov, Pivotal
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.

To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.

We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.

Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.

Speakers
avatar for Christian Tzolov

Christian Tzolov

Pivotal Inc
Christian Tzolov, Pivotal technical architect, BigData and Hadoop specialist, contributing to various open source projects. In addition to being an Apache® Committer and Apache Crunch PMC Member, he has spent over a decade working with various Java and Spring projects and has led several enterprises on large scale artificial intelligence, data science, and Apache Hadoop® projects. twitter: @christzolov blog: http://blog.tzolov.net



Tuesday November 15, 2016 12:00 - 12:50
Nervion/Arenal II/III

12:00

Using Apache Spark for Generating ElasticSearch Indices Offline - Andrej Babolcai, ESET
Making historical data available for searching can be a challenge, especially if you have a lot of it. Indexing data to a live cluster can degrade search performance and having a spare cluster where you index your data can be expensive. In this talk we present the approaches we tried and describe an approach to create ElasticSearch indices offline using Apache Spark. When created, these indices are then stored as snapshots in HDFS and can then be restored to a running ElasticSearch cluster. Snapshots in HDFS also work as a backup, ready to restore solution in case of an error.

Speakers
AB

Andrej Babolcai

Software Engineer, Eset
Software Engineer at ESET Currently working with Big Data technologies at ESET. Responsible for collecting and storing and making data available for end users. Previously worked at Honeywell. Speaking experience: Caro workshop 2016 (http://2016.caro.org/)


Tuesday November 15, 2016 12:00 - 12:50
Giralda I/II

12:00

Building Streaming Applications with Apache Apex - Thomas Weise & Chinmay Kolhatkar, DataTorrent
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

Speakers
CK

Chinmay Kolhatkar

Chinmay is Software Engineer at DataTorrent Software, India and committer on the Apache Apex project.
avatar for Thomas Weise

Thomas Weise

Architect, DataTorrent
Thomas is Apache Apex PMC chair and architect/co-founder at DataTorrent. He has developed distributed systems, middleware and web applications since 1997. Prior to DataTorrent he was in the Hadoop Team at Yahoo! and contributed to several of the ecosystem projects.



Tuesday November 15, 2016 12:00 - 12:50
Carmona

13:00

Large Scale Open Source Data Processing Pipelines at Trivago - Clemens Valiente, Trivago
trivago is processing roughly 7 billion events per day with an architecture that is entirely open source - from producing the data until its visualization in dashboards and reports. This talk will explain the idea behind the pipeline, highlight a particular business use case and share the experience and engineering challenges from two years in production. Clemens Valiente will furthermore show the different tools, frameworks and systems used, with Kafka for data ingestion, hadoop and Hive for processing and Impala for querying as the main focus. The successful implementation of this large scale data processing pipeline fundamentally transformed the way trivago was able to approach its business.

Speakers
avatar for Clemens Valiente

Clemens Valiente

Lead Data Engineer, trivago GmbH
I'm part of trivago's Data Engineering team where we are running a data processing pipeline through kafka, hadoop, impala and R processing roughly 7 billion events per day. Our hadoop cluster is central for BI dashboards, reports, ad hoc analyses, personalisation, bidding and recommendation algorithms as well as our invoicing.


Tuesday November 15, 2016 13:00 - 13:50
Giralda VI/VII

13:00

Massively Parallel Data Warehousing in the Hadoop Stack - Gregory Chase & Roman Shaposhnik, Pivotal
Hadoop has been touted as a replacement for data warehouses.  In practice Hadoop has had success offloading ETL/ELT workloads, but still has gaps serving requirements for operational analytics.

Apache Bigtop now includes Greenplum Database in deployment of big data solutions. Greenplum Database is, an open source massively parallel data warehouse  based on PostgreSQL, and is an excellent addition to the Hadoop ecosystem.

In this session we'll cover:
  • Introduction to Greenplum 
  • Bigtop Support for Greenplum
  • External tables in Hadoop by Greenplum
  • Parallel reads and writes to Hadoop by Greenplum
  • Running advanced analytics on structured and unstructured data in both Hadoop and Greenplum via Apache MADlib (incubating)
  • Geospatial and Machine Learning in Greenplum based on HDFS data
  • Storing data from a data lake in Greenplum for high throughput analytical queries

Speakers
GC

Gregory Chase

Director of Big Data Communities, Pivotal Software
Greg Chase is an enterprise software marketing executive more than 20 years experience in marketing, sales, and engineering with software companies. Most recently Greg has been passionately advocating for innovation and transformation of business and IT practices through big data, cloud computing, and business process management in his role as Director of Product Marketing at Pivotal Software. Greg is also a wine maker, dog lover, community... Read More →
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Pivotal Inc.
Roman Shaposhnik is a Director of Open Source at Pivotal Inc. He is a committer on Apache Hadoop, co-creator of Apache Bigtop and contributor to various other Hadoop ecosystem projects. He is also an ASF member and a former Chair of Apache Incubator. In his copious free time he managed to co-author "Practical Graph Analytics with Apache Giraph" and he also posts to twitter as @rhatr. Roman has been involved in Open Source software for more than a... Read More →


Tuesday November 15, 2016 13:00 - 13:50
Nervion/Arenal I

13:00

What 50 Years of Data Science Leaves Out - Sean Owen, Cloudera

We're told "data science" is the key to unlocking the value in big data, but, nobody seems to agree just what it is -- engineering, statistics, both? David Donoho's paper "50 Years of Data Science" offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it's anything at all. This talk will examine these points, and respond with an engineer's counterpoints, in search of a better understanding of data science.


Speakers
avatar for Sean Owen

Sean Owen

Director of Data Science, Cloudera
Sean is Director of Data Science, based in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time learning on Hadoop. He is on the Apache Spark PMC and co-author of Advanced Analytics with Spark. Previously, Sean | was a senior engineer at Google. He holds an MBA from London Business School and a BA from Harvard University.


Tuesday November 15, 2016 13:00 - 13:50
Estepa

13:00

Scalable Private Information Retrieval: Introducing Apache Pirk (incubating) - Ellison Anne Williams, Creator of Apache Pirk
Querying information over TBs of data where no one can see what you query or the responses obtained? It sounds like science fiction, but it is actually the science of Private Information Retrieval (PIR). This talk will introduce Apache Pirk - a new incubating Apache project designed to provide a framework for scalable, distributed PIR. We will discuss the motivation for Apache Pirk, its distributed implementations in platforms such as Spark and Storm, itäó»s current algorithms, the power of homomorphic encryption, and take a look at the path forward.

Speakers
EA

Ellison Anne Williams

Ellison Anne Williams is a creator and PMC member of Apache Pirk, a pure mathematician by training, and a practical computer scientist in real life. Her passion is doing cool stuff with massive amounts of data.


Tuesday November 15, 2016 13:00 - 13:50
Carmona

13:00

SASI, Cassandra on the Full Text Search Ride! - DuyHai Doan, Datastax
Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities.



Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE '%term%'` syntax.



To illustrate how SASI works, we'll use a database of 100 000 albums and artists. We'll also show how SASI can help to accelerate analytics scenarios with Apache Spark using SparkSQL predicate push-down.



We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic, sorry)


Speakers
avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai Doan is an Apache Cassandra evangelist and Apache Zeppelin committer. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects to support the community and helping all companies using Cassandra to make their project successful. He also gets an interest in all the eco-system around Cassandra (Spark, Zeppelin, ...). Previously he was working as a freelance Java/Cassandra consultant


Tuesday November 15, 2016 13:00 - 13:50
Nervion/Arenal II/III

13:00

Elastic Spark Programming Framework - A Dependency Injection Based Programming Framework for Spark Applications - Bruce Kuo
Apache Spark is the hottest computing engine nowadays.

More people in big data industry start to use Spark in production systems,

such as machine learning and data ETL applications.

However, developers take quality of code seriously in production systems.

In the past experience, we find there is a gap between development and production.

The difficulties we meet when developing Spark applications in production systems are:

(1) hard to communicate with components,

(2) indirect management of application arguments,

(3) inadequate code maintainability.

To solve these problems and make development smoother, we propose

a dependency-injection-based programming framework on JVM systems.

It provides basic management, monitoring, and better communication mechanisms.

The huge flexibility can help developers writing Spark applications and integrating with components in a better manner.

Speakers
avatar for Bruce Kuo

Bruce Kuo

Software Engineer, Yahoo!
Chun-Ting Kuo (Bruce) works at Yahoo as a data engineer, and he dedicates his work on developing data products and scientific applications. His experience covers Spark, Hadoop, algorithms, and a little machine learrning. When he is free, he loves to code and know novel techniques.


Tuesday November 15, 2016 13:00 - 13:50
Giralda I/II

13:00

Power Pig with Spark - Liyun Zhang, Intel
Apache Pig is a popular scripting platform for processing and analyzing large data sets in the Hadoop ecosystem. With its open architecture and backend neutrality, Pig scripts can currently run on MapReduce and Tez. Apache Spark is an open-source data analytics cluster computing framework that has gained significant momentum recently. Besides offering performance advantages, Spark is also a more natural fit for the query plan produced by Pig. Pig on Spark enables improved ETL performance while also supporting users intending to standardize to Spark as the execution engine.

Speakers
LZ

Liyun Zhang

Software Engineer, Intel
Liyun Zhang is a Software Engineer at Intel. She is one of main contributors of Pig on Spark project. Prior to that, she made several contributions to Intel Distribution for Hadoop.


Tuesday November 15, 2016 13:00 - 13:50
Giralda V

13:00

Sparkler - Crawler on Apache Spark - Karanjeet Singh & Thamme Gowda, USC
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this presentation, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. GitHub Link - https://github.com/USCDataScience/sparkler

Speakers
avatar for Thamme Gowda

Thamme Gowda

Graduate Student, University of Southern California
Thamme Gowda is a grad student at the Univ. of Southern California, Los Angeles, CA, and also an intern at NASA Jet Propulsion Laboratory, Pasadena, CA, USA. He is a co-founder of Datoin.com, a software as a service platform built using Hadoop and Spark. He is also a committer and PMC member of Tika, Nutch, and Podling PMC of Joshua (Incubating).
avatar for Karanjeet Singh

Karanjeet Singh

Research Assistant, University of Southern California
He is pursuing his Master's degree in Computer Science from the University of Southern California (USC). His projects and research are mostly from the area of Information Retrieval and Data Science. He is also affiliated with NASA Jet Propulsion Lab. Prior to this, he was working at Computer Sciences Corporation (CSC) as a web developer for a U.S. based financial firm.



Tuesday November 15, 2016 13:00 - 13:50
Giralda III/IV

13:50

14:00

Women in Big Data Luncheon & Program
Limited Capacity seats available

On behalf of Women in Big Data, we'd like to invite you to a luncheon/meetup event taking place Tuesday, November 15.

This luncheon is open to women and allies who are interested in attending to network and collaborate with other like-minded individuals, with the ultimate goal of strengthening and increasing diversity in the big data community. The luncheon is free to attend, but space is limited. Please RSVP here if you'd like to attend.


Luncheon Agenda



  • 1:50pm - WiBD Overview – Anna Marchon

  • 2:00pm - Keynote: Tina Rosario, Global VP, Enterprise Data Management at SAP 

  • 2:30pm Keynote: Marina Alekseeva, GM of the Intel Software and Service Group in Russia

  • 3:00pm - Networking


 

 A big thank you to our lunch sponsor, Women in Big Data. For more details on WiBD and how to get involved, visit https://www.womeninbigdata.org/

Speakers
avatar for Marina Alekseeva

Marina Alekseeva

Director of Software Product Services, Intel SSG
Marina Alekseeva is the General Manager of the Intel Software and Service Group (SSG) in Russia and the Director of Software Product Services, a multinational multifunctional team which provides complete solutions and infrastructure for software production and product delivery. Member of Intel EMEA Diversity and Inclusion Board, active mentor and training instructor. She started her professional career as a software developer at Russian... Read More →
avatar for Tina Rosario

Tina Rosario

Tina Rosario - Global Vice President, Enterprise Data Management, SAP
Tina Rosario is a business strategy professional with over 25 years of experience in IT, business process re-engineering, change management and enterprise data management. During her 12 years at SAP, Tina has held executive positions in business operations, consulting services and corporate strategy.  Her expertise ranges from building best practice information governance programs, driving data technology development and managing teams of... Read More →


Tuesday November 15, 2016 14:00 - 15:10
Santa Cruz

15:30

Low Latency Web Crawling on Apache Storm - Julien Nioche, DigitalPebble Ltd.
StormCrawler is an open source collection of resources, mostly implemented in Java, for building low-latency, scalable web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what StormCrawler provides, we will compare it with similar projects like Apache Nutch and present several real life use cases. In particular we will see how StormCrawler can be used with ElasticSearch and Kibana for crawling and indexing web pages and also monitor the crawl itself.

Speakers
avatar for Julien Nioche

Julien Nioche

Director, DigitalPebble Ltd
I run DigitalPebble Ltd, a consultancy based in Bristol, UK and specialising in open source solutions for text engineering. My expertise covers web crawling, natural language processing, machine learning and search. I am a committer on Apache Nutch and am also involved in several other open source projects, including StormCrawler and Behemoth. I gave talks at conferences such as ApacheCon, BerlinBuzzwords and LuceneSOLRRevolution.


Tuesday November 15, 2016 15:30 - 16:20
Giralda III/IV

15:30

Your Datascience Journey with Apache Zeppelin - Moon soo Lee, Anthony Corbacho & Jongyoul Lee, NFLabs
Take a journey together to see how Apache Zeppelin started, how Apache Zeppelin helps your data science lifecycle, how Apache Zeppelin became popular TLP project. We'll also see how community focus has been changed, from basic notebook feature, spark integration to advanced features like multi-tenancy. Lee moon soo will explain value of Apache Zeppelin with some key use case scenario demo. Also we'll see eco-system around it - How various projects and companies are using Apache Zeppelin in their product and services in many different ways.

Finally, we'll discuss about Apache Zeppelin's future roadmap with some challenges that community have.

Speakers
avatar for Moon

Moon

cto, NFLabs
Moon soo Lee is a creator for Apache Zeppelin and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and getting adoptions.
avatar for Jongyoul Lee

Jongyoul Lee

Software Development Engineer, NFLabs
Jongyoul Lee is a member of PMC of Apache Zeppelin and works at NFLabs. In Apache Zeppelin, He focuses on stabilizing Apache Zeppelin to be used in production level, developing some enterprise features and enhancing Spark/JDBC features. Personally, He is really interested in distributed, fault-tolerant systems.


Tuesday November 15, 2016 15:30 - 16:20
Carmona

15:30

AMIDST Toolbox: A Java Toolbox for Scalable Probabilistic Machine Learning - Andres Masegosa, NTNU
We would like to present our open source AMIDST toolbox for analysis of large-scale data sets using probabilistic machine learning models. AMIDST runs algorithms in a distributed fashion for learning a wide range of latent variable models such as Gaussian mixtures, (probabilistic) principal component analysis, Hidden Markov Models, Kalman Filter, Latent Dirichlet Allocation, etc. This toolbox is able to learn any user-defined probabilistic (graphical) model with billions of nodes using novel message passing algorithms.



We plan to give an overview of the AMIDST toolbox, some details about the API and the integration with Flink, Spark (and other open source tools) and an analysis of the scalability of our learning algorithms. All this in the context of a real use case scenario in the financial domain (BCC group), where millions of customers profiles are analyzed.

Speakers
avatar for Andres Masegosa

Andres Masegosa

Phd, NTNU
I am a research fellow at NTNU (Norway) with broad interests in data mining and machine learning using probabilistic graphical models. Lately, my research has focused on scalable machine learning methods for solving real use cases in the financial (BCC group) and automotive industry (Daimler group). I am coauthor of more than 50 scientist papers in different journal and international conferences covering applied areas such as bioinformatics... Read More →


Tuesday November 15, 2016 15:30 - 16:20
Estepa

15:30

Classifying Unstructured Text - Deterministic and Machine Learning Approaches - Christian Winkler & Stephanie Fischer, mgm Technology Partners GmbH
Text is one of the most used forms of communication and ubiquitous in the Internet. Social networks like Facebook and Twitter mainly contain unstructured text; the same is true for content-driven websites.



For humans it is easy to grasp the meaning of text - much more difficult for computers. Used correctly, computers can help humans tremendously in structuring and classifying huge amounts of text. This "symbiosis" can help humans work more efficiently, reduce repetitve work and use the uncovered structure.



Our talk starts with visualizations giving us ideas how to automatically classify texts. Then we will demonstrate that manual intervention is sometimes necessary and how this can be used as a basis for machine learning. This helps significantly in classifying more complicated cases.



As software tools we use R, Apache Solr, D3.js, and several NLP and ML tools from the ASF.

Speakers
avatar for Stephanie Fischer

Stephanie Fischer

Big Data, Agile and Change Management, mgm consulting partners
I concentrate on user-centricity of Big Data technologies. My focus is finding the questions really worth solving. I think Big Data has the potential to advance humanity into a desirable direction. I have a background in organizational development, agility and business analytics.
avatar for Christian Winkler

Christian Winkler

Enterprise architect, mgm technology partners GmbH
Christian has worked for 20 years with Internet technologies. Recently, he has focused on working with large amounts of data or many users. As big data applications become more and more popular, lots of applications evolve. Many aggregates have to be calculated to describe charcteristics of data sets. This is why he concentrates on intelligent algorithms like machine learning to find those interpretations. Often he uses sophisticated... Read More →


Tuesday November 15, 2016 15:30 - 16:20
Giralda V

15:30

User Defined Functions and Materialized Views in Cassandra 3.0 - DuyHai Doan, Datastax
Cassandra is evolving at a very fast pace and keeps introducing new features that close the gap with traditional SQL world, but they are always designed with a distributed approach in mind.



First we'll throw an eye at the recent user-defined functions and show how they can improve your application performance and enrich your analytics use-cases.



Next, a tour on the materialized views, a major improvement that drastically changes the way people model data in Cassandra and makes developers' life easier!

Speakers
avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai Doan is an Apache Cassandra evangelist and Apache Zeppelin committer. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects to support the community and helping all companies using Cassandra to make their project successful. He also gets an interest in all the eco-system around Cassandra (Spark, Zeppelin, ...). Previously he was working as a freelance Java/Cassandra consultant


Tuesday November 15, 2016 15:30 - 16:20
Nervion/Arenal II/III

15:30

Building and Running a Solr-as-a-Service for IBM Watson - Shai Erera, IBM
Running a managed Solr service brings fun challenges with it, to both the users and the service itself. Users typically do not have access to all components of the Solr system (e.g. the ZK ensemble, the actual nodes that Solr runs on etc.). On the other hand the service must ensure high-availability at all times, and handle what is often user-driven tasks such as version upgrades, taking nodes offline for maintenance and more.



In this talk I will describe how we tackle these challenges to build a managed Solr service on the cloud, which currently hosts few thousands of Solr clusters. I will focus on the infrastructure that we chose to run the Solr clusters on, as well how we ensure high-availability, cluster balancing and version upgrades.

Speakers
avatar for Shai Erera

Shai Erera

STSM, Social Analytics & Technologies, IBM
Shai Erera is a Researcher at IBM Research, Haifa, Israel. Shai earned his M.Sc in Computer Science from the University of Haifa in 2007. Shai’s work experience includes the development of search-based systems over Lucene and Solr and he is also a Lucene/Solr committer.


Tuesday November 15, 2016 15:30 - 16:20
Giralda VI/VII

15:30

Getting Started Contributing to Apache Spark - Holden Karau, IBM
Apache Spark is one of the most popular tools for big data, and with over 400 open pull requests as of this writing very active in terms of development as well. With such a large volume of contributions, it can feel difficult to started contributing to Apache Spark. This talk is developer focused and will walk through how to find good issues to start with, formatting code, finding reviewers, and what to expect in the code review process. We will also talk about alternatives to contributing to Apache Spark directly (such as creating packages).

Speakers
avatar for Holden Karau

Holden Karau

Principal Software Engineer, IBM
Holden Karau is a software development engineer and is active in open source. She a co-author of Learning Spark & Fast Data Processing with Spark and has taught intro Spark workshops. Prior to IBM she worked on a variety of big data, search, and classification problems at Alpine, DataBricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science. Outside of computers she... Read More →


Tuesday November 15, 2016 15:30 - 16:20
Giralda I/II

15:30

Implementing BigPetStore in Spark and Flink - Márton Balassi, Cloudera
Implementing use cases on unified data platforms. Having a unified data processing engine empowers Big Data application developers as it makes connections between seemingly unrelated use cases natural. This talk discusses the implementation of the so-called BigPetStore project (which is a part of Apache Bigtop) in Apache Spark and Apache Flink. The aim BigPetStore is to provide a common suite to test and benchmark Big Data installations. The talk features best practices and implementation with the batch, streaming, SQL, DataFrames and machine learning APIs of Apache Spark and Apache Flink side by side. A range of use cases are outlined in both systems from data generation, through ETL, recommender systems to online prediction.

Speakers
avatar for Márton Balassi

Márton Balassi

Solutions Architect, Cloudera
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences and meetups, including Hadoop Summit and Apache Big Data recently. Márton has been a speaker at ApacheCon, Hadoop Summit and numerous Big Data... Read More →


Tuesday November 15, 2016 15:30 - 16:20
Nervion/Arenal I

16:30

Unified Benchmarking of Big Data Platforms - Axel-Cyrille Ngonga Ngomo, INFAI
Which Big Data Platform shown I use for my problem? This question remains one of the most important question for practitioners. In this talk, we will present the universal benchmarking platform for Big Data HOBBIT (htpp://project-hobbit.eu). The platform providies a unified approach for benchmarking Big Data frameworks. Mimicking algorithms generated from real data ensure that the dataset used for benchmarking resemble real data but are open for all to use, therewith circumventing the issues that come about when using company-bound data. The core of the platform implements industry-relevant KPI gathered from more than 70 Big-Datad-driven organizations. The results are generated using machine-readable formats so as to ensure that they can be analyzed and use for improving toold and frameworks. In the talk, I will present the architecture of the framework and some preliminary results.

Speakers
avatar for Axel-Cyrille Ngonga Ngomo

Axel-Cyrille Ngonga Ngomo

Head of Research Group, INFAI
Head of AKSW (http://aksw.org) at University of Leipzig/InfAI, a research group with ca. 50 members. Author of 120+ research papers and 20+ presentations are top-tier conferences. Received manifold research awards including Next Einstein Forum award 2016, 12 best research paper awards and competition wins. Coached by Lisa Shufro (ex-TED coach) in speaking. Currently head of the HOBBIT project (http://project-hobbit.eu), which focuses on... Read More →


Tuesday November 15, 2016 16:30 - 17:20
Giralda VI/VII

16:30

Apache Sentry - High Availability - Sravya Tirukkovalur & Hao Hao, Cloudera
As big data continues to get bigger, deploying flexible and robust security is more important than ever. In this talk, we'll discuss about Apache Sentry, a central service for policy management and its various pluggable authorization engines which integrates with many Hadoop components. And we will dive deep into how its latest design allows for fault tolerance, high availability and scalability.



Unlike traditional database systems, authorization in Hadoop eco system is a tricky problem due to the fact that there are multiple doors to the same data. Sentry provides a great deal of usability by letting users define policies once and it replicates the state as necessary. With this comes additional challenges of designing a distributed service which manages consistent state. This talk will touch upon core design choices which lie as building blocks to any robust distributed system.

Speakers
avatar for Hao Hao

Hao Hao

Software Engineer, Cloudera Inc
Hao Hao is a software engineer at Cloudera. She is an active committer and a PMC member of Apache Sentry project. Hao has performed extensive research on smartphone security, web security while she was a PhD student at Syracuse University. Prior to joining Cloudera, Hao worked at eBay’s Search Backend team to build search infrastructure for eBay’s online buying platform. See www.linkedin.com/in/hao-hao
avatar for Sravya Tirukkovalur

Sravya Tirukkovalur

Software Engineer, Cloudera
Sravya Tirukkovalur is a software engineer at Cloudera working on Hadoop security. She is one of the active contributors to the Apache Sentry project and also the PMC Chair. She got her Masters degree from The Ohio State University, with her research focus on High performance and Distributed computing. She is passionate about social impact through technology and volunteers outside of her day job. See... Read More →


Tuesday November 15, 2016 16:30 - 17:20
Nervion/Arenal I

16:30

Meerkat: Anomaly Detection as a Service - Julien Herzen, Swisscom
Julien will present Meerkat, a system built at Swisscom to do real-time anomaly detection on time series. Meerkat uses a combination of machine learning and big data technologies in order to trigger alerts in case of problems in Swisscom network.

Meerkat monitors arbitrary time series and trains statistical models that can be used to spot anomalies from both batch (historical) and streaming (live) data. It is composed of a Python modules for anomaly detection and data ingestion from Druid, as well as Scala modules using Apache Spark for ingesting from Apache Kafka and Apache Hadoop's HDFS.

Meerkat is currently successfully used at Swisscom to trigger alerts in case of problems with VoIP calls, which represent more than 3 millions phone calls per day.

This is joint work with Khue Vu, who worked on Meerkat for his MSc thesis at EPFL, and the network intelligence team of Swisscom Innovation.

Speakers
avatar for Julien Herzen

Julien Herzen

data scientist, Swisscom
Julien is a data scientist at Swisscom. His experience lies in the areas of machine learning and network algorithms, and his current work includes building analytics and monitoring platforms using big data technologies such as Apache Spark, Druid and Apache Cassandra. He has a PhD from EPFL and talked at various venues such as IEEE Infocom, IEEE ICNP, ACM Mobicom and the Swiss machine learning days.


Tuesday November 15, 2016 16:30 - 17:20
Giralda V

16:30

The Myth of the Big Data Silver Bullet - Why Requirements Still Matter - Nick Burch, Quanticate
We've all heard the hype - Big Data will solve all your storage, processing and analytic problems effortlessly! As Big Data moves along the adoption cycle, there's a wider range of possible technologies and platforms you could use, but sadly picking the right one still remains crucial to success.  Some moving beyond the buzzwords to deploy Big Data find things really do work well, but others rapidly run into issues. The difference usually isn't the technologies or the vendors per-se, but their appropriateness to the requirements, which aren't always clear up-front...

This session won't tell you what Big Data solution you need. Instead, we'll cover some of the pitfalls, and help you with the questions towards working out your requirements in time for your Big Data system to be a success!

Speakers
NB

Nick Burch

CTO, Apache Software Foundation
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in ""Content"" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance. | | Nick is CTO at Quanticate, a Clinical Research Organisation (CRO) with a strong focus on data and statistics. | | Nick has spoken at most ApacheCons since 2007, and as well as many... Read More →


Tuesday November 15, 2016 16:30 - 17:20
Nervion/Arenal II/III

16:30

Avro: Travel Across (r)evolution - Arek Osinski & Darek Eliasz, Allegro Group
In those days, we are generating enormous amount of data. Biggest challenge is hidden in transformation of raw data to knowledge. We would like to take you on a short travel and show our approach for conversion from non-structured world of microservices to the world with Avro schemas inside our data pipelines.

Avro is well known format for storing and online processing information of any kind. What are key features of this format? What are the common problems? Where you can meet pitfails? How this influences our Big Data ecosystem?

Whole story will be covered by examples from real life implementation.

Speakers
avatar for Dariusz Eliasz

Dariusz Eliasz

Senior Data Platform Engineer, Allegro
Mainly interested in: | - big data platform architecture | - data governance | | Enthusiast of scalable distributed solutions, processing large amounts of data and continuous improvement.
AO

Arek Osinski

Senior Data Platform Engineer, Allegro
Works in Allegro Group as a senior data engineer. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cycling.


Tuesday November 15, 2016 16:30 - 17:20
Carmona

16:30

Multi-Tenant Machine Learning with Apache Aurora and Apache Mesos - Stephan Erb, Blue Yonder GmbH
Data scientists care about statistics and fast iteration cycles for their experiments. They should not be concerned with technicalities like hardware failures, tenant isolation, or low cluster utilization. In order to shield its data scientists from these matters, Blue Yonder is using Apache Aurora.



When adopting Aurora, our goal was to run multiple machine learning projects on the same physical cluster. This talk will go into details of this adoption process and highlight key engineering decisions we have made. Particular focus will reside on the multi-tenancy and oversubscription features of Apache Aurora and Apache Mesos, its underlying resource manager.



Audience members will learn about the fundamentals of both Apache projects and how those can be assembled into a capable machine learning platform.

Speakers
avatar for Stephan Erb

Stephan Erb

Software Engineer, Blue Yonder GmbH
Stephan Erb is a software engineer driven by the goal to make Blue Yonder's data scientists more productive. Stephan holds a master's degree in computer science from the Karlsruhe Institute of Technology (KIT). He is a PMC member of the Apache Aurora project and tweets at @ErbStephan.


Tuesday November 15, 2016 16:30 - 17:20
Santa Cruz

16:30

Ranking the Web with Spark - Sylvain Zimmer, Common Search
Common Search is building an open source search engine based on Common Crawl's monthly dumps of several billion webpages. Ranking every URL on the Web in a transparent and reproducible way is core to the project.



In this presentation, Sylvain Zimmer will explain why Spark is a great match for the job, how the current ranking pipeline works and what challenges it faces to grow in scale and complexity, in order to improve the quality of search results.



Specifically, we will dive in the new Spark 2.0 features that made it practical to compute PageRank from Python on every URL found in Common Crawl, and show how anyone can reproduce and tweak the results on their cloud servers.

Speakers
SZ

Sylvain Zimmer

Founder, Common Search
Sylvain Zimmer is a software developer and longtime free culture advocate. In 2004 he founded Jamendo, the largest Creative Commons music community online. Since 2012, he has been the CTO of Pricing Assistant, a startup specialized in large-scale crawling of E-commerce websites. He is also the founder and main curator of dotConferences, a series of TED-like developer events in Paris. | | More recently, he started Common Search, an ambitious... Read More →


Tuesday November 15, 2016 16:30 - 17:20
Giralda III/IV

16:30

Writing Apache Spark and Apache Flink Applications Using Apache Bahir - Luciano Resende, IBM
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.

Speakers
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for Apache Bahir, and also spend time mentoring newly created Apache Incubator projects. At IBM, he contributed to several IBM big data offerings, including BigInsights... Read More →


Tuesday November 15, 2016 16:30 - 17:20
Giralda I/II

17:20

 
Wednesday, November 16
 

07:00

Morning Run
Come meet in the lobby of the Melia Sevilla at 7:00am for a morning run. The plan is to cross to the park and run next to the river. 

This will last an hour and the group will be back by 8:00am.

Wednesday November 16, 2016 07:00 - 08:00
Melia Sevilla Hotel Lobby

08:30

Breakfast
Wednesday November 16, 2016 08:30 - 09:30
Giralda Foyer

08:30

Sponsor Showcase
Wednesday November 16, 2016 08:30 - 12:00
Triana Foyer

08:30

Registration
Wednesday November 16, 2016 08:30 - 13:00
Triana Foyer

09:30

Keynote: Introduction to Tensorflow: Tips and Tricks for Neural Net Design - Gema Parreño, AI Developer
Tensorflow has been part of the core of google search engine and it is provided as an open source online tool since last novemeber . The keynote will introduce into the architecture of the library focused on machine vision and will dive into data modeling of global finalist 2016 NASA SPACE APPS CHALLENGE project.

Speakers
GP

Gema Parreño

AI Developer
Gema Parreño is a several times awarded product designer that has been focused into Artificial Intelligence and software architecture for 2 years highlighting experiences with Natual Languaje Understanding. She is now  developing recursive neural networks and clustering classifications in Tensor Flow with Python since Google opened TensorFlow focusing now on Sentiment Analysis.


Wednesday November 16, 2016 09:30 - 09:50
Giralda I/II

09:55

Keynote: Lessons from the Trenches: How Apache Hadoop is Being Used & The Challenges Its Users Face - John Mertic, Director, ODPi and Open Mainframe Project, Linux Foundation
Apache Hadoop has earned the support of a large & diverse community, with significant interest from businesses, governments, academia & technology vendors – each varying in their goals & objectives for benefiting from the technology. While the distributed data platform’s ecosystem continues to grow, there remains some debate about its ease of adoption & how a wide-range of users can gain business value from it. This session from John Mertic, Director of Program Management for ODPi, will cover how solution providers, app vendors & end users are deploying Apache Hadoop, the daily challenges they face in their environments, how they’d like to use the technology moving forward & much more. Citing insights from ODPi members Capgemini, Linaro & GE, Mertic will break down what he’s learned to demystify the most common Apache Hadoop complexities & barriers to further enterprise adoption.

Speakers
avatar for John Mertic

John Mertic

Director - ODPi and Open Mainframe Project, Linux Foundation
John Mertic is Director of Program Management for ODPi and Open Mainframe Project at The Linux Foundation. Previously, Mertic was director of business development software alliances at Bitnami. Mertic comes from a PHP and Open Source background, being a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and frequent conference speaker around the world. As an avid writer, Mertic has published... Read More →


Wednesday November 16, 2016 09:55 - 10:10
Giralda I/II

10:00

BarCampApache
Join us for an ‘unconference’ with no set schedule, facilitated by those involved in various Apache projects. More details and registration information can be found here:
https://wiki.apache.org/apachecon/BarCampApacheSeville

Wednesday November 16, 2016 10:00 - 16:20
Estepa

10:15

Coffee Break
Wednesday November 16, 2016 10:15 - 11:00
Giralda Foyer

11:00

On-Premise, UI-Driven Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service - Jim Dowling, KTH Royal Institute of Technology
Since April 2016, SICS Swedish ICT has provided Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service to researchers in Sweden. We have developed a UI-driven multi-tenant platform (Apache v2 licensed) in which researchers securely develop and run their applications. Applications can be either deployed as jobs (batch or streaming) or written and run directly from Notebooks in Apache Zeppelin. All applications are run on YARN within a security framework built on project-based multi-tenancy. A project is simply a grouping of users and datasets. Datasets are first-class entities that can be securely shared between projects. Our platform also introduces a necessary condition for elasticity: pricing. Application execution time in YARN is metered and charged to projects, that also have HDFS quotas for disk usage. We also support project-specific Kafka topics that can also be securely shared.

Speakers
JD

Jim Dowling

Associate Prof, KTH - Royal Institute of Tech
Jim Dowling is an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology as well as a Senior Researcher at SICS – Swedish ICT. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is a distributed systems researcher and his research interests are in the area of... Read More →


Wednesday November 16, 2016 11:00 - 11:50
Giralda III/IV

11:00

Why is My Hadoop Cluster Slow? - Steve Loughran, Hortonworks
Apache Hadoop is used to run jobs that execute tasks over multiple machines with complex dependencies between tasks. And at scale, there can be 10äó»s to 1000äó»s of tasks running over 100's to 1000äó»s of machines which increases the challenge of making sense of their performance. Pipelines of such jobs that logically run a business workflow add another level of complexity. No wonder that the question of why Hadoop jobs run slower than expected remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache Software Foundation, and a committer on the Hadoop core since 2009. He lives and works in Bristol, England.


Wednesday November 16, 2016 11:00 - 11:50
Carmona

11:00

Introduction to Apache Beam - Jean-Baptiste Onofré, Apache Software Foundation & Dan Halperin, Google
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam's programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.


Speakers
DH

Dan Halperin

Google
Dan Halperin is a PPMC member and committer on Apache Beam (incubating). He has worked on Beam and Google Cloud Dataflow for 18 months. Prior to that, he was the Director of Research for Scalable Data Analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in oceanography, astronomy, medical informatics, and the life sciences. Dan received his Ph.D. in Computer Science and Engineering at the... Read More →
JO

Jean-Baptiste Onofré

Apache Software Foundation
JB is Apache Beam's champion and a member of the Beam PPMC. He is a long-tenured Apache Member, serving on as PMC/committer for 20 projects that range from integration to big data. | | Dan is a Beam PPMC member, committer, and a Google software engineer working on Apache Beam and the Google Cloud Dataflow runner for Beam.


Wednesday November 16, 2016 11:00 - 11:50
Nervion/Arenal II/III

11:00

Big Data Machine Learning with Apache PredictionIO - Simon Chan, Salesforce
Apache PredictionIO (incubating) provides a full stack machine learning environment on top of Apache Spark, making it easy for developers to iterate on production-deployable machine learning engines. Apache PredictionIO is designed for data scientists and developers to build predictive web services for real-world applications in a fraction of the time normally required.

In this talk, the speaker will introduce the latest developments of PredictionIO, and show how to use it to build and deploy predictive engines in real production environments. Using PredictionIOäó»s DASE design pattern, Simon will illustrate how developers can build machine learning applications with the separation of concerns (SoC) in mind. The speaker will also go over the future roadmap of Apache PredictionIO and some of its recent development.


Speakers
SC

Simon Chan

Senior Director, Salesforce
Simon Chan is an open-source product innovator, with 13 years of tech management experience in various countries. He is the founder and former CEO of the company that created PredictionIO - currently ranked on Github as the most popular Spark-based machine learning OSS project in the world. PredictionIO company is acquired by Salesforce in 2016 and the open-source product has been accepted by ASF as Apache PredictionIO (incubating). Simon is... Read More →


Wednesday November 16, 2016 11:00 - 11:50
Giralda V

11:00

Machine Learning on Apache Apex with Apache Samoa - Bhupesh Chawda, DataTorrent Software
This talk will be about the integration of Apache Samoa, a distributed streaming machine learning framework (https://samoa.incubator.apache.org) with Apache Apex, a distributed, scalable and fault-tolerant stream processing engine (https://apex.apache.org). Apache Samoa is a kind of WORA (write-once-run-anywhere) framework where algorithms developed on Samoa can be run on other distributed stream processing engines like Storm, Samza and Flink. This talk will introduce the integration story with Apache Apex and outline the process and the challenges therein. In addition, the talk will also dwell upon some comparative analysis on the performance of Samoa algorithms on few popular integrated runners, namely Apache Storm, Apache Flink and Apache Apex.

Speakers
avatar for Bhupesh Chawda

Bhupesh Chawda

Software Engineer, DataTorrent Software India Pvt. Ltd.
Bhupesh Chawda is a Software Engineer at DataTorrent Software, India and a committer on the Apache Apex project under the Apache Software Foundation. Previously he was a Research Engineer at IBM India Research Labs, New Delhi. His interests are in the areas of distributed systems, stream processing and machine learning. He has experience delivering talks at international conferences like EDBT (2013) and ACM IKDD CODS (2016). He has publications... Read More →


Wednesday November 16, 2016 11:00 - 11:50
Santa Cruz

11:00

Attacking a Big Data Developer - Olaf Flebbe, science+computing ag
Developers are a possible attack vector for targeted attacks to infiltrate malicious code

into enterprises.



The Speaker did a network traffic analysis with the Bro Network Security Monitor (bro.org)

backed by an ELK Stack while compiling Apache Bigtop, a Big Data Distribution containing

Apache Hadoop, Spark, HBase, Hive, Flink et al.



While there are no obvious traces of a malicious code within the traffic, there are many

findings of possible attack vectors like unsecurely configured critical software infrastructure

servers, usage of private repositories or unsecure protocols.



The Analysis showed that many compile jobs are downloading and running executables from untrusted sources.

The author will shortly explain how these weaknesses can be exploited and will give recommendations on how to resolve these issues.

Speakers
OF

Olaf Flebbe

Chief Software Architect
Dr. Olaf Flebbe received his PhD in computational physics in Tübingen, Germany. He works as the chief software architect at science+computing ag. He is a member of the PMC of Apache Bigtop. Occasionally he gives talks about random projects at various conferences.


Wednesday November 16, 2016 11:00 - 11:50
Giralda I/II

11:00

Shared Memory Layer and Faster SQL for Spark Applications - Dmitriy Setrakyan, GridGain
In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy... Read More →


Wednesday November 16, 2016 11:00 - 11:50
Nervion/Arenal I

11:00

SQL and Streaming Systems - Atri Sharma, Microsoft
The talk shall focus on to design and build systems for stream based data and exploiting the power of SQL and relational algebra on Streaming data using Apache Apex and Apache Calcite.

Speakers
avatar for Atri Sharma

Atri Sharma

Software Engineer, Azure Data Lake, Microsoft
An Apache Apex committer where he is engaged in designing and implementing next generation features and performing reviews.A learning PostgreSQL hacker who is currently engaged in various aspects of Postgres.He has been an active contributor,implementing ordered set functions, implementing grouping sets in Postgresql, improving sort and hashjoin performance and OLAP performance. He is also a committer for Apache HAWQ, Apache MADLib and has been... Read More →


Wednesday November 16, 2016 11:00 - 11:50
Giralda VI/VII

12:00

Get in Control of Your Workflows with Apache Airflow - Christian Trebing, Blue Yonder
Whenever you work with data, sooner or later you stumble across the definition of your workflows. At what point should you process your customeräó»s data? What subsequent steps are necessary? And what went wrong with your data processing last Saturday night?



At Blue Yonder we use Apache Airflow to solve these problems. It can be extended with new functionality by developing plugins in Python. With Airflow, we define workflows as directed acyclic graphs and get a shiny UI for free. Airflow comes with some task operators which can be used out of the box to complete certain tasks. For more specific cases, you can also develop new operators in your plugin.



This talk will explain the concepts behind Airflow, demonstrating how to define your own workflows and how to extend the functionality. Youäó»ll also get to hea about our experiences using this tool in real-world scenarios.

Speakers
CT

Christian Trebing

Senior Software Engineer
Christian is a Software Developer from Karlsruhe, Germany. He has studied Computer Science at TU Darmstadt. Currently he is working on big data applications at Blue Yonder, enjoying the challenges at the intersection between software engineering and data science.


Wednesday November 16, 2016 12:00 - 12:50
Carmona

12:00

Mining and Identifying Security Threat Using Spark SQL, HBase and Solr - Manidipa Mitra, ValueLabs
This presentation will talk about how to deisgn a highly effective scalable/performant distributed system to find the identity theft and fraud by mining billions of records related to share holding for a leading financial organization. This will also discuss on how Tera bytes of data can be migrated from Oracle to Hadoop, stored in parquet format, processed in a distributed computing framework with Spark DataFrame and pushed to different service layer (HBase, Impala, Solr, HDFS) depends on the query/access pattern. This design will also throw light on how the frequent transactions were handled and data were pre-processed end of the day to meet the seconds response time SLA, creating thousands of report by mining millions of record in minutes time.

Speakers
avatar for Manidipa Mitra

Manidipa Mitra

Director, ValueLabs
Manidipa Mitra heads the Big Data CoE in ValueLabs having extensive experience in building industry specific solution using distributed computing and cloud technologies . Having 16+ years of software industry experience and in-depth knowledge on disruptive-technologies, Cloud and Storage . She is holding dual graduate degree in Physics and Computer science. Manidipa previously Invited as Speaker in Grace Hopper Conference 2013, and presently... Read More →


Wednesday November 16, 2016 12:00 - 12:50
Giralda III/IV

12:00

Smart Storage Management: Towards Higher HDFS Storage Efficiency - Wei Zhou, Intel
All kinds of data volume increases dramatically in recent years, new storage devices (NVMe SSD, flash SSD, etc.) can be utilized to improve data access performance. HDFS provides methodologies like HDFS Cache, Heterogeneous Storage Management (HSM) and Erasure Coding (EC) to provide such support, but it remains a big challenge to define and adjust different storage strategies for different data in a dynamic environment.

To overcome the challenge and improve the storage efficiency of HDFS, we will introduce a comprehensive solution, aka Smart Storage Management (SSM) in Apache Hadoop. HDFS operation data and system state information are collected from the cluster, based on the metrics collected SSM can extract some äóìdata access patternsäó and based on these patterns SSM will automatically make sophisticated usage of these methodologies to optimize HDFS storage efficiency.

Speakers
WZ

Wei Zhou

Software engineer in Intel. Currently mainly focus on Apache Hadoop performance optimization. Co-speaker on HBase Developer Course in Strata+Hadoop world Beijing 2016.


Wednesday November 16, 2016 12:00 - 12:50
Giralda V

12:00

Hands On! Deploying Apache Hadoop Spark Cluster with HA, Monitoring, and Logging in AWS - Andrew Mcleod & Peter Vander Giessen, Canonical
This is a hands-on workshop style session where attendees will learn how to deploy complex workloads such as a 10 node Hadoop Spark cluster complete with HA, Logging, and Monitoring. We can then scale the cluster from there pending needs. Attendees will also learn how to deploy other workloads such as connecting Apache Kafka into the Solution, connecting Apache Zeppelin into the solution, or trying the latest Cloud Native Kubernetes. We will then run a sample TeraSort, Spark Job, and Pagerank benchmak to get familiar with the cluster. An AWS controller will be provided for folks who don't have cloud access.
No prior knowledge is needed, but if you want to get a head start install the Juju client by following the docs @ http://jujucharms.com/get-started


Wednesday November 16, 2016 12:00 - 12:50
Giralda I/II

12:00

Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics - Mike Percy, Cloudera
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.



This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.

Speakers
avatar for Mike Percy

Mike Percy

Software Engineer, Cloudera
Mike Percy is a software engineer at Cloudera and a PMC member on Apache Kudu, an open source distributed column store for the Hadoop ecosystem. He is also a PMC member on Apache Flume. Prior to joining Cloudera, Mike worked at Yahoo! building machine learning infrastructure for Big Data. Mike holds a BSCS from UC Santa Cruz and an MSCS from Stanford.



Wednesday November 16, 2016 12:00 - 12:50
Nervion/Arenal I

12:00

How Software Engineering Has Changed with Advent of OSS - Nupur Sharma, Ingenium Data Systems
The talk shall explore the business of open source and how open source has changed the way software engineering is done and executed. Earlier, every software process was done as a ballpark project and designed with commercial, non-extensible products in mind.With the new open source paradigm, companies are now driving software development with open source products as the core and leveraging the extensibility of the product itself. In the talk, Nupur shall drive through the thought process of product designers through the 1990s, 2000s and now. Nupur shall explain how organisations are adapting Open Source Software and building their entire business models around them. Driving through some use cases, the transition from closed source to open source in many existing and well thought processes shall be discussed and explored. This shall enlighten any org exploring to move to OSS paradigm.

Speakers
avatar for Nupur Sharma

Nupur Sharma

Director, Ingenium Data Systems
The talk shall explore the business of open source and how open source has changed the way software engineering is done and executed. Earlier, every software process was done as a ballpark project and designed with commercial, non-extensible products in mind.With the new open source paradigm, companies are now driving software development with open source products as the core and leveraging the extensibility of the product itself. In the talk... Read More →


Wednesday November 16, 2016 12:00 - 12:50
Giralda VI/VII

12:00

Performance Tuning Tips for Apache Spark Machine Learning Workloads - Shreeharsha GN & Amir Sanjar, IBM
Performance Tuning tips for Apache Spark Machine Learning workloads - OpenPOWER 8 architecture is the latest offering of IBM SoftLayer, is the perfect platform for evaluating and optimizing Apache Spark solutions. In under 60 minutes from receiving a Sotlayer welcome package to your new bare-metal Power8 server, you can have Hadoop and Spark, along with many other software applications, installed, configured, optimized, and ready to run Spark ML workload. In this talk we will cover: 1) Apache Spark overview 2) Apache Spark software deployment 3) Spark optimization on highly threaded server 4) Demo"

Speakers
avatar for Shreeharsha GN

Shreeharsha GN

Senior software engineer, IBM
Shreeharsha GN has many years of experience in the field of Performance Engineering for software applications and Java stack optimization, big data software and IBM java stack performance optimization at companies including IBM, Azul systems , HCL, Infosys. He is a SPEC member and was a key person responsible for certifying energy efficiency of IBM enterprise servers. At present, he is working on Apache spark Open Power enablement and Apache... Read More →


Wednesday November 16, 2016 12:00 - 12:50
Santa Cruz

12:00

Apache CouchDB 2.0 Sync Deep Dive - Jan Lehnardt, Neighbourhoodie Software
This talks takes a deep dive below the magic and explains how to build robust sync systems, whether you want use CouchDB or build your own.

The talk will go through the components of a successful data sync system and which trade-offs you can take that solves your particular problems.

Reliable data sync, from Big Data to Mobile.

Speakers
avatar for Jan Lehnardt

Jan Lehnardt

CEO, Neighbourhoodie Software
Jan Lehnardt is the PMC Chair and VP of Apache CouchDB, co-creator of the Hoodie web app framework based on CouchDB as well as the founder and CEO of Neighbourhoodie Software. He’s the longest standing contributor to Apache CouchDB.


Wednesday November 16, 2016 12:00 - 13:00
Nervion/Arenal II/III

13:00

Highly Scalable Big Data Analytics with Apache Drill - Tom Barber, Meteorite Consulting
Big Data analytics is becoming more and more popular as the query response times improve. We'll look at building and deploying a fully operational and highly scalable Apache Bigtop based Big Data Analytics platform with no code.

In this talk we'll utilise the power of the open source Juju application modelling platform to deploy our software and configure it for us. We'll also discuss deployment options, scalability and resilliency allowing users to get the most from the data.

Speakers
avatar for Tom Barber

Tom Barber

Technical Director, Meteorite Consulting
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals with Devops and data processing systems for customers and clients, both in the UK, Europe and also North America.


Wednesday November 16, 2016 13:00 - 13:50
Carmona

13:00

Distributed Logistic Model Trees - Mateo Alvarez & Antonio Soriano, Stratio Big Data
Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano will present a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leafs. While being highly interpretable, the LMT consistently performs equal or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.

Speakers
MA

Mateo Alvarez

Big Data developer/ Data Scientist, Stratio
Mateo Álvarez studied aerospace engineering at the Universidad Politécnica de Madrid, with a masters degree in Propulsion Systems, and Data Science in the Universidad Rey Juan Carlos. He is passionate about data analysis with Scala, Python and all Big Data technologies, and is currently part of the Data Science team at Stratio Big Data, working with ML algorithms, profiling analysis based around Spark. | | | | We at Stratio have been... Read More →


Wednesday November 16, 2016 13:00 - 13:50
Giralda VI/VII

13:00

Real Time Aggregation with Kafka, Spark Streaming and ElasticSearch, Scalable Beyond Million RPS - Dibyendu Bhattacharya, InstantLogic
While building a massively scalable real time pipeline to collect transaction logs from network traffic, one of the major challenges was performing aggregation on streaming data on the fly. This was needed to compute multiple metrics across various dimensions which help our customer to see near real time views of application delivery and performance. In this talk, learn how we designed our real time pipeline for doing multi-stage aggregation powered by Kafka ,Spark Streaming and ElasticSearch. At InstartLogic we used custom Spark Receiver for Kafka which is used in first stage aggregation. The second stage includes Spark Streaming driven aggregation within given batch window . Final stage aggregation involves custom ElasticSearch plugins to aggregate across Batches. I will cover this multi-stage aggregation,including optimisation across all stages which is scalable beyond million RPS

Speakers
avatar for Dibyendu Bhattacharya

Dibyendu Bhattacharya

Data Platform Engineer, InstartLogic
Dibyendu Holds MS in Software Systems and B.Tech in Computer Science having experience in building applications and products leveraging distributed computing and big data technologies. Presently working as Data Platform Engineer at InstartLogic, the world's first endpoint-aware application delivery solution for making web and mobile application fast . Dibyendu has extensive experience building scalable data platform specialised in streaming... Read More →



Wednesday November 16, 2016 13:00 - 13:50
Giralda III/IV

13:00

Scio, a Scala DSL for Apache Beam - Robert Gruener, Spotify
Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.

Speakers
RG

Robert Gruener

Software Engineer, Spotify
I have been at Spotify for 3 years working on popular music recommendation features such as Discover Weekly and Release Radar. At Spotify I have been a large user of Scalding, Cassandra, and now Scio in order to make sense of our huge amount of data and find the perfect song to present all 100M of our users.


Wednesday November 16, 2016 13:00 - 13:50
Nervion/Arenal II/III

13:00

Parquet Format in Practice & Detail - Uwe L. Korn, Blue Yonder
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

Speakers
avatar for Uwe L. Korn

Uwe L. Korn

Data Scientist, Blue Yonder GmbH
Uwe Korn is a Data Scientist at the German RetailTec company Blue Yonder. His expertise is on building architectures for machine learning services that are scalably usable for multiple customers aiming at high service availability as well as rapid prototyping of solutions to evaluate the feasibility of his design decisions. As part of his work to provide an efficient data interchange he became a core committer to the Apache Parquet project.


Wednesday November 16, 2016 13:00 - 13:50
Nervion/Arenal I

13:00

On the Representation and Reuse of Machine Learning Models - Villu Ruusmann, Openscoring Ltd.
Big Data applications rely on machine learning to derive new value. Model training and deployment are handled by different people in different environments, which makes model transferability a major concern.



This talk inquires into popular R, Scikit-Learn and Apache Spark model types, and connects them at a standardized PMML representation level. PMML adds value to all stages of the workflow, starting from model interpretation, reorganization and persistence, and ending with fully-automated model deployment to schema-full Big Data frameworks.



Attendees will learn that models are not locked-in "black boxes", but easily accessible and programmable components in the application layer. This realization should translate to improved workflows, and smarter and more performant applications.

Speakers
VR

Villu Ruusmann

CTO, Openscoring OÜ
Villu Ruusmann is the founder and CTO of Openscoring Ltd, a company provides an open source implementation of the Predictive Model Markup Language (PMML) standard. Villu has extensive knowledge about popular machine learning model training and deployment platforms, which he has turned into a mass of PMML-based integration solutions. | | | | Villu has limited public speaking experience from his academic pursuits. The latest talk was held at... Read More →


Wednesday November 16, 2016 13:00 - 13:50
Santa Cruz

13:00

What's With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends - Nick Burch, Quanticate
Large amounts of unknown data seeks helpful tools to identify itself and generate content!

With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet's worth, this won't scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.

First we'll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we'll look a how to roll this all out on a Big Data scale.

Speakers
NB

Nick Burch

CTO, Apache Software Foundation
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in ""Content"" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance. | | Nick is CTO at Quanticate, a Clinical Research Organisation (CRO) with a strong focus on data and statistics. | | Nick has spoken at most ApacheCons since 2007, and as well as many... Read More →


Wednesday November 16, 2016 13:00 - 13:50
Giralda I/II

13:00

Apache Ignite - JCache and Beyond - Dmitriy Setrakyan, GridGain
This presentation will provide a good overview of Apache Ignite project including a detailed look into distributed in-memory Data Grid, Compute Grid, Streaming, in memory SQL, and many other components provided by Apache Ignite. We will also go into detail of how existing in-memory caching products and data grids can be used to share memory across Apache Spark jobs and applications. We will also present a hands on demo demonstrating performance benefits of querying shared memory using SQL.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy... Read More →


Wednesday November 16, 2016 13:00 - 13:50
Giralda V