Optimizing Distributed Data Processing for ML at Scale

https://hackernoon.imgix.net/images/2jqChkrv03exBUgkLrDzIbfM99q2-nk0206s.jpeg

Most of the performance work I have done on ML data pipelines has not come from clever algorithms. It has come from removing waste wasted shuffles, wasted scans, wasted memory, wasted I/O. I find that genuinely a little disappointing when I look back on it, because the elegant solutions get all the credit and the boring optimizations do most of the work. This is a piece about the boring optimizations. The ones that consistently move the needle when I am asked to make a distributed ML workload faster, cheaper, or more reliable.

Start by Reading the Plan

I cannot overstate how much time I save by just looking at the query plan first. Most performance problems are visible in the plan before the job even runs.

I always start here.

EXPLAIN FORMATTEDSELECT u.user_id, u.signup_date, COUNT(DISTINCT e.event_id) AS num_events, AVG(e.amount) AS avg_amountFROM gold.users uLEFT JOIN gold.events e ON u.user_id = e.user_idWHERE...

Copyright of this story solely belongs to hackernoon.com. To see the full text click HERE