Apache: Big Data Europe 2016
Click here to Register or for more information 
Back To Schedule
Wednesday, November 16 • 13:00 - 13:50
What's With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends - Nick Burch, Quanticate

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Large amounts of unknown data seeks helpful tools to identify itself and generate content!

With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet's worth, this won't scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.

First we'll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we'll look a how to roll this all out on a Big Data scale.


Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in "Content" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance.Nick is CTO at Quanticate, a... Read More →

Wednesday November 16, 2016 13:00 - 13:50 CET
Giralda I/II