Apache: Big Data Europe 2016
Click here to Register or for more information 
Back To Schedule
Tuesday, November 15 • 16:30 - 17:20
Ranking the Web with Spark - Sylvain Zimmer, Common Search

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Common Search is building an open source search engine based on Common Crawl's monthly dumps of several billion webpages. Ranking every URL on the Web in a transparent and reproducible way is core to the project.

In this presentation, Sylvain Zimmer will explain why Spark is a great match for the job, how the current ranking pipeline works and what challenges it faces to grow in scale and complexity, in order to improve the quality of search results.

Specifically, we will dive in the new Spark 2.0 features that made it practical to compute PageRank from Python on every URL found in Common Crawl, and show how anyone can reproduce and tweak the results on their cloud servers.


Sylvain Zimmer

Founder, Common Search
Sylvain Zimmer is a software developer and longtime free culture advocate. In 2004 he founded Jamendo, the largest Creative Commons music community online. Since 2012, he has been the CTO of Pricing Assistant, a startup specialized in large-scale crawling of E-commerce websites. He... Read More →

Tuesday November 15, 2016 16:30 - 17:20 CET
Giralda III/IV