class: center, middle, inverse, title-slide # Managing the Machine Learning Lifecycle with MLflow and R ##
kevinykuo.com/talk/2018/10/sais-eu
### Kevin Kuo
@kevinykuo
### October 2018 --- # This talk .Large[Repurposed from a **sparklyr**/NLP talk at the last minute.] .Large[In case you're here for that, we'll give a 30 second update in a little bit!] --- # About me .Large[ - Software Engineer, Overhyped Technologies @RStudio - (big data, deep learning, AI, etc.) - Options trader -> math grad student -> actuary -> management consultant -> data scientist -> software engineer ] -- .Large[ - I also care about things that matter, like buildings tools to help R users put models into production. ] -- .large[- I'm also studying for the Certified Sommelier exam 🍷] --- <!-- --> --- # Structured streaming with Shiny integration .pull-left[ ```r reactiveCount <- stream_read_text(sc, "source/") %>% ft_tokenizer("line", "tokens") %>% ft_stop_words_remover("tokens", "words") %>% transmute(words = explode(words)) %>% filter(nchar(words) > 0) %>% group_by(words) %>% summarize(n = n()) %>% arrange(desc(n)) %>% filter(n > 100) %>% reactiveSpark() ui <- fluidPage(plotOutput("wordsPlot")) server <- function(input, output) { output$wordsPlot <- renderPlot({ reactiveCount() %>% head(n=10) %>% ggplot() + aes(x=words, y=n) + geom_bar(stat="identity") }) } ``` ] .pull-right[ <img src="img/2018-10-01-sparklyr-shiny-app-books.gif" width="100%" /> ] --- # Monitoring and interrupting jobs <img src="img/2018-10-01-sparklyr-monitored-connections.png" width="100%" /> Now you can hit `<STOP>` in the IDE without breaking your session! --- # k8s ```r sc <- spark_connect(config = spark_config_kubernetes("k8s://hostname:8443")) ``` .large[because scale, etc.] -- .Large[ Blog post: [blog.rstudio.com/2018/10/01/sparklyr-0-9/](https://blog.rstudio.com/2018/10/01/sparklyr-0-9/) Documentation: [spark.rstudio.com](http://spark.rstudio.com/) ] -- .Large[ Also check out **parsnip** (successor to **caret**) integration: [github.com/topepo/parsnip](https://github.com/topepo/parsnip) (note: experimental) ] --- class: center, inverse, middle # Back to the our regularly scheduled programme... --- class: center, middle, inverse # .Large[MOTIVATION] --- class: center, middle # `theme: productionization` --- <img src="img/infinite_loop.png" width="90%" /> [youtu.be/-K9SjrWpeys](https://youtu.be/-K9SjrWpeys) [@josh_wills](https://twitter.com/josh_wills) --- # Gross generalization .Large[Data scientists using R primarily tend to be stats/natural sciences/social sciences types who picked up programming...] -- <img src="img/hadoop-elephant.jpg" width="30%" /> .Large[`%in% room`] -- .Large[...so they're clueless when it comes to software engineering principles and hardening models 🤷.] --- # Python users, on the other hand... -- .pull-left[ <!-- --> ] -- .pull-right[ but really, we coo' <!-- --> ] --- .pull-left[ Data scientist vs. ML engineer <img src="img/chopper1.jpg" width="100%" /> ] -- .pull-right[ <img src="img/chopper2.jpg" width="75%" /> ] --- class: center, middle Spark ML, xgboost, random CRAN packages, Keras, TensorFlow, scikit-learn, H2O, ... <!-- --> --- class: center, middle # `theme: reproducibility` --- class: center, middle <!-- --> [nature.com](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970) --- class: center, middle, inverse # These problems had already been solved... -- # for some. --- .pull-left[ <img src="img/uber.png" width="80%" /> [eng.uber.com](https://eng.uber.com/michelangelo/#) ] .pull-right[ <img src="img/fb.png" width="80%" /> [code.fb.com](https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/) ] --- class: middle, center, inverse # But what if you're *not* a big tech company? 🤔 --- class: center, middle .large[PMML? PFA? MLflow? New vendor in the exhibit hall?] <img src="img/two_buttons.jpg" width="40%" /> --- # Efforts in the R ecosystem (excerpt) .Large[ - *mleap*: MLeap integration for sparklyr for serializing Spark ML pipelines - *tfruns*: Track and Visualize Training Runs (for TF and Keras) - RStudio Connect: Native TF model deployment, arbitrary R models via plumber - RStudio Connect: Reproducible report publishing and sharing - **mlflow: interface to MLflow** ] -- .Large[We likely won't ever solve everyone's problems with one framework, but we should be able to standardise on 90% of the problems and have good/generally accepted guidance on the rest.] --- # MLflow .Large[ - **Tracking:** keep track of your parameters, notes, and metrics for experiments. - **Project:** bundle your project and environment so others can reproduce your results. - **Model:** serialize and package your scoring function for serving locally and on the cloud. ] --- class: middle, center, inverse # Demo time! --- # Recap - MLflow is cool and you should check it out. Share feedback and use cases! - [github.com/mlflow/mlflow](https://github.com/mlflow/mlflow/tree/master/mlflow/R/mlflow). For R issues tag `@kevinykuo` and `@javierluraschi`. - R package will be on CRAN soon. - Link to this talk (incl. demo to be uploaded after talk): [kevinykuo.com/talk/2018/10/sais-eu/](https://kevinykuo.com/talk/2018/10/sais-eu/)