Wrangle, July 20

What is it like to have a real job in data science? How can we get a reasonable data forecast on messy data with no manual effort? Can ethics be guided with algorithms? Is it possible that governments can be more collaborative with open data? How can we improve data-driven financial institutions? What are the latest data science tools and experimentation coming out of Facebook, Airbnb, UCSF, Capital One, and Salesforce?

Wrangle is a one-day, single track community event that hosts the best and brightest in the Bay Area talking about the principles, practice, and application of Data Science, across multiple data-rich industries. Join Cloudera to discuss future trends, how they can can be predicted, and most importantly—how can they be anticipated.

"#WrangleConf was fantastic! By far the most practical/thought-provoking data science conference I've ever attended."



9:30am - 10:00am - What Would a CIA Data Scientist Do?

Drew Conway, CEO, Aluvium

Earlier this year I had the opportunity to speak to data scientist at the Central Intelligence Agency about the discipline of data science in 2017. That talk was equal parts technical and philosophical, but here I will focus on the latter. The discipline of data science has a very different feel today than it did a year ago. What was once the realm of the "unicorns" and "rocks stars," is now a place of scorn and ridicule. This is particularly sensitive for data scientist in the intelligence community. I will discuss how the last year has changed our professional posture, its affect on those in the intelligence community, and ask the audience to consider their own work through this prism.

10:00am - 10:30am - The Ethics of Everybody Else

Tyler Schnoebelen, Principal Product Manager,

We aren’t surprised by facial recognition at security checkpoints. But how do you feel about face-scanning toilet roll dispensers? What if they don’t just find criminals but try to detect “criminality”? Laws and policies almost always lag technology so data scientists and machine learning experts are among the first line of ethical defense. The argument in this talk is that to be ethical, any system that classifies human beings has to consider the goals of the people affected by the system, not just the builders’ goals. This is not particularly convenient, but there are concrete ways to put goal-oriented design into practice. Doing so puts us in a better position to practice ethical behavior and attempt to address problems of power and the reproduction of inequality.

10:30am - 11:00am - Unlocking the Power of Social Chatter; Recent Endeavors @ Netflix

Sui Huang, Senior Data Scientist, Netflix

Netflix is making great strides in creating moments of joy via delivering high-quality content, internally we are pushing for a deeper understanding of how these moments/anticipation of joy manifest and traverse online and offline through word of mouth, what implications these moments of joy have on our business. We brought many machine learning techniques to bear to this space (e.g. NLP, causal ML). This talk will cover our recent endeavors in this area and how these studies will empower our business partners in their decision makings throughout the company.

11:15am - 11:45am - Digital Government: Data + Government Isn't Enough

Trey Causey, Product Manager, Socrata

Government agencies are collecting and producing data at an accelerating rate, and constituents want access to this data with decreasing latency. Meeting a digitally savvy polity's desire for data while ensuring that data is open, accessible, and interpretable by all comes with unique challenges. I'll share some of these while walking through how governments are building their own data products using open data as well as empowering civic hackers. I'll also walk through why data science at the government level is fundamentally different than data science in the private sector.

11:45am - 12:15pm - Measurement with Intention

Sean Taylor, Research Scientist, Facebook

What we choose to measure has a profound impact on every decision we make, from our day-to-day personal habits to strategies for major corporations and governments. Metrics create a shared understanding of a problem, suggest paths toward solutions, and create or destroy incentives. With the proliferation of measurement technologies and data-driven decision making in the digital age, choosing the right concepts to measure and pay attention to may ultimately be the most important decisions we make. I'll discuss what qualities good metrics have, how people decide what to measure in practice, and how have innovations in measurement technologies have had dramatic impacts across a variety of domains.

Challenges in Building Data Science Products, Moderator: Clare Corthell


Clare Corthell, Data Product Manager, Clover Health


Derek Steer, CEO, Mode

Daniel Tunkelang, Consultant

Grant Ingersoll, CTO, Lucidworks

Chris Nicholson, CEO, Skymind

Matt Brandwein, Product Director, Cloudera

2:00pm - 2:30pm - The 'Joy' and Surprise of Healthcare Data

Jasmine Tsai, Data Platform Engineering Manager, Clover Health

Healthcare, like other industries with legacy systems, is full of data with particularly archaic and mysterious formats. It is also particularly hierarchical and networked, because of the nature of its systems (just think of what a hospital entails) and the complexity human body (this is not a joke). In this talk, we will talk through some salient features and landscape of healthcare data and the particular challenges and rewards it presents in transformations for usage — and how a modern data system might approach it differently from its older counterparts.

2:30pm - 3:00pm - Building Robust Pipelines with Airflow

Erin Shellman, Senior Data Scientist, Zymergen

The data science team at Zymergen is applying machine learning techniques to identify genetic targets, work that is supported by extensive analytical automation that systematically identifies outliers, removes process-related bias, and quantifies performance improvements. We’re using Apache Airflow to construct robust data pipelines that allow us to produce clean, reliable inputs to our predictive models. In this talk, I’ll discuss the unique data processing challenges we face in working with high-throughput, biological data and provide an overview of how we’re using Apache Airflow to meet those challenges.

ETL Panel, Moderator: Josh Wills


Josh Wills, Head of Data Engineering, Slack

Panel Speakers:

Joe Hellerstein, Co-Founder & CSO, Trifacta

Jeff Magnusson, VP Data Platform, Stitchfix

Maxime Beauchamin, Software Engineer, Airbnb

3:45pm - 4:15pm - Lessons From Integrating Machine Learning Models into Data Products

Sharath Rao, Data Scientist and Engineering Manager, Instacart

In this talk, we will share practical lessons and patterns for building machine learning (ML) models in production, based on our experience with search ranking and recommendation systems at Instacart. As part of this I will include a detailed discussion on the technical challenges in building a ML features pipeline, one of which is now shared across multiple data products at Instacart.

4:15pm - 4:45pm - Special Guest: Nellwyn Thomas, Deputy Chief Analytics Officer, Hillary Clinton Campaign

4:45pm - 5:15pm - Closing

Program Committee

