Answering Multi-dimensional Queries on Massive Datasets

Originally aired:

About the Session

In this talk I will share some of the lessons learnt while building a data product in the advertising domain to improve business decisions and leverage data for new revenue streams.

For large datasets with multiple dimensions, summaries computed for  each dimension can be quickly combined to obtain an accurate summary of various combinations of the dimensions (union, intersection, etc.). Summarization of data can be done in a fully distributed manner, by partitioning the data arbitrarily across many nodes, summarizing each partition, and  combining the results.

We will explore  some of the approaches and techniques for the summarisation like - hyperloglog, kth minimal value etc. We will explain how we solved this in the context of a real world problem using Data Sketches, an open source library that provides summarisation as  unique counts, quantiles, frequent items etc.

When dealing with massive datasets, as well as highly partitioned/fragmented data, it inherently has a long-tail distribution across all the fragments. In the above mentioned case study, the data pipeline ingests terabytes of data every week and if we were to create a sketch for every dimension-combination, we would end up with millions of fragments, each with its own sketch. We will explore methods to address the system level storage as well as compute challenges, as well as achieve lower error rates.

See Highlights

Hear What Attendees Say

PwC

“Once again Saltmarch has knocked it out of the park with interesting speakers, engaging content and challenging ideas. No jetlag fog at all, which counts for how interesting the whole thing was."

Cybersecurity Lead, PwC

Intuit

“Very much looking forward to next year. I will be keeping my eye out for the date so I can make sure I lock it in my calendar."

Software Engineering Specialist, Intuit

GroupOn

“Best conference I have ever been to with lots of insights and information on next generation technologies and those that are the need of the hour."

Software Architect, GroupOn

Hear What Speakers & Sponsors Say

Scott Davis

“Happy to meet everyone who came from near and far. Glad to know you've discovered some great lessons here, and glad you joined us for all the discoveries great and small."

Web Architect & Principal Engineer, Scott Davis

Dr. Venkat Subramaniam

“Wonderful set of conferences, well organized, fantastic speakers, and an amazingly interactive set of audience. Thanks for having me at the events!"

Founder of Agile Developer Inc., Dr. Venkat Subramaniam

Oracle Corp.

“What a buzz! The events have been instrumental in bringing the whole software community together. There has been something for everyone from developers to architects to business to vendors. Thanks everyone!"

Voltaire Yap, Global Events Manager, Oracle Corp.