9-mllibunlocked

.pdf

School

New York University**We aren't endorsed by this school

Course

CSCI-GA 2437

Subject

Computer Science

Date

Dec 18, 2024

Pages

Uploaded by MagistrateBookJellyfish291

Dr. Yang TangNYU:CSCI-GA.2437Fall 2024Big Data Application DevelopmentLecture 9 — Machine learning with MLlib

AgendaWhat is machine learning? Designing machine learning pipelinesHyperparameter tuning2

What is machine learning?Machine learning:noun.the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data.— Oxford LanguagesBroadly speaking, machine learning is a process for extracting patterns from your data, using statistics, linear algebra, and numerical optimization.3

What is machine learning?ExamplesMachine learning can be applied to many problems…•Predicting power consumption.•Determining whether or not there is a cat in your video.•Clustering items with similar characteristics.•…4

What is machine learning?Types of machine learningThere are a few types of machine learning…•Supervised learning.•Semi-supervised learning.•Unsupervised learning.•Reinforcement learning.5

What is machine learning?Supervised learningIn supervised machine learning, your data consists of a set of input records.•Each of the data has associated labels.The goal is to predict the output label(s) given a new unlabeled input.•These output labels can either be discrete or continuous.•Two types of supervised machine learning: classification and regression.6

Supervised learningClassificationIn a classification problem, the aim is to separate the inputs into a discrete set of classes or labels.•Binary classification.•Multiclass (multinomial) classification.7

Supervised learningClassificationBinary classification:there are two discrete labels you want to predict.Example:dog or not dog.8

Supervised learningClassificationMulticlass (multinomial) classification:there can be three or more discrete labels you want to predict.Example:predicting the breed of a dog.Australian shepherd golden retriever poodle9

Supervised learningRegressionIn regression problems, the value to predict is a continuous number, not a label.•You might predict values that your model hasn’t seen during training.Example:build a model to predict the daily ice creamsales given the temperature.•Your model might predict the value $77.67, even ifnone of the input/output pairs it was trained oncontained that value.10

Supervised learningCommonly used supervised ML algorithms in Spark MLlib11AlgorithmTypical usageLinear regressionRegressionLogistic regressionClassification (I know, it has regression in the name!)Decision treesBothGradient boosted treesBothRandom forestsBothNaive BayesClassificationSupport vector machines (SVMs)Classification

What is machine learning?Unsupervised learningObtaining the labeled data required by supervised machine learning can be very expensive and/or infeasible…•This is where unsupervised machine learning comes into play.Instead of predicting a label, unsupervised machine learning helps you to better understand the structure of your data.12

Unsupervised learningExampleThere is no known true label for each of these data points .However, by applying unsupervised machine learning to our data, we can find the clusters that naturally form.(x1,x2)13

Unsupervised learningUsage scenariosUnsupervised machine learning can be used…•For outlier detection.•As a preprocessing step for supervised machine learning.•Example:reduce the dimensionality (i.e., number of dimensions per datum) of the dataset, which is useful for reducing storage requirements or simplifying downstream tasks.•…14

Unsupervised learningCommonly used unsupervised ML algorithms in Spark MLlibHere are some unsupervised machine learning algorithms in MLlib…•k-means.•Latent Dirichlet Allocation (LDA).•Gaussian mixture models.•…15

Why Spark for machine learning?Spark is a unified analytics engine that provides an ecosystem for data ingestion, feature engineering, model training, and deployment.Without Spark, developers would need many disparate tools to accomplish this set of tasks, and might still struggle with scalability.16

Why Spark for machine learning?Spark has two machine learning packages: spark.mlliband spark.ml.•spark.mllibis the original machine learning API, based on RDDs.•spark.mlis the newer API, based on DataFrames.17

Why Spark for machine learning?With spark.ml, you can use one ecosystem for your data preparation and model building, without the need to downsample your data to fit on a single machine.spark.mlfocuses on scale-out.•The model scales linearly with the number of data points you have, so it can scale to massive amounts of data.If you have previously used Python scikit-learn, many of the APIs in spark.mlwill feel quite familiar, but there are some subtle differences.O(n)18

AgendaWhat is machine learning?Designing machine learning pipelinesHyperparameter tuning19

Designing machine learning pipelinesThe concept of pipelines is common across many ML frameworks as a way to organize a series of operations to apply to your data.In MLlib, the Pipeline API provides a high-level API built on top of DataFrames to organize your machine learning workflow.•The Pipeline API is composed of a series of transformers and estimators.20

Designing machine learning pipelinesMLlib terminologyTransformer A transformer accepts a DataFrame as input, and returns a new DataFrame with one or more columns appended to it.•Transformers do not learn any parameters from your data.•They simply apply rule-based transformations in order to…•Prepare data for model training.•Generate predictions using a trained MLlib model.Transformers have a .transform()method.21

Designing machine learning pipelinesMLlib terminologyEstimator An estimator learns (or “fits”) parameters from your DataFrame and returns a Model.•The returned Model is a transformer.Estimators have a .fit()method.22

Designing machine learning pipelinesMLlib terminologyPipeline A pipeline organizes a series of transformers and estimators into a single model.Pipelines themselves are estimators.The output of pipeline.fit()returns a PipelineModel.•The returned PipelineModel is a transformer.23

Designing machine learning pipelinesThe datasetToday, we will use the San Francisco housing dataset from Inside Airbnb.It contains information about Airbnb rentals in San Francisco, such as…•The number of bedrooms.•Location.•Review scores.•…Goal:build a model to predict the nightly rental prices for listings in that city.•This is a regression problem, because price is a continuous variable.24

The dataset25

Designing machine learning pipelinesData ingestion and explorationLet’s take a quick peek at the dataset and the corresponding schema…26scala> val rawDF = spark.read.option("header", "true").option("multiLine", "true").option("inferSchema", "true").option("escape", "\"").csv("sf-airbnb.csv") rawDF: org.apache.spark.sql.DataFrame = [id: int, listing_url: string ... 104 more fields] scala> rawDF.select("neighbourhood_cleansed", "room_type", "bedrooms", "bathrooms", "number_of_reviews", "price").show(5) +----------------------+---------------+--------+---------+-----------------+-------+ |neighbourhood_cleansed| room_type|bedrooms|bathrooms|number_of_reviews| price| +----------------------+---------------+--------+---------+-----------------+-------+ | Western Addition|Entire home/apt| 1| 1.0| 180|$170.00| | Bernal Heights|Entire home/apt| 2| 1.0| 111|$235.00| | Haight Ashbury| Private room| 1| 4.0| 17| $65.00| | Haight Ashbury| Private room| 1| 4.0| 8| $65.00| | Western Addition|Entire home/apt| 2| 1.5| 27|$785.00| +----------------------+---------------+--------+---------+-----------------+-------+

Designing machine learning pipelinesData ingestion and explorationThis dataset is quite messy and can be diﬃcult to model.•Like most real-world datasets!•It contains 100+ fields (i.e., columns)!•It contains missing data and outliers!27

Designing machine learning pipelinesData ingestion and explorationSo we need to do some exploratory data analysis and cleansing.•Select an informative subset of the fields.•Convert all integers and numerical strings to doubles.•Remove outliers.•Example:listings posted for $0/night.28

Designing machine learning pipelinesCreating training and test datasetsBefore we begin feature engineering and modeling, we will divide our dataset into two groups: train and test.•Depending on the size of your dataset, your train/test ratio may vary.•Many people use 80/20 as a standard train/test split.29

Designing machine learning pipelinesCreating training and test datasetsWhy not use the entire dataset to train the model? If we built a model on the entire dataset, it’s possible that the model would “overfit” to the training data we provided, and we would have no more data with which to evaluate how well it generalizes to previously unseen data.The model’s performance on the test set is a proxy for how well it will perform on unseen data (i.e., in the wild or in production), assuming that data follows similar distributions.30

Designing machine learning pipelinesCreating training and test datasets— the number of data points (a.k.a.examples).— the number of features (a.k.a.fields or columns).For every example, there is one label.nd31

Designing machine learning pipelinesCreating training and test datasetsDifferent metrics are used to measure the performance of the model.For classification problems, a standard metric is the accuracy (i.e., percentage) of correct predictions.Once the model has satisfactory performance on the training set using that metric, we will apply the model to our test set.•If it performs well on our test set according to our evaluation metrics, then we can feel confident that we have built a model that will “generalize” to unseen data.32

Designing machine learning pipelinesCreating training and test datasetsWhen creating training and test datasets, we will set a random seed for reproducibility.•If we rerun this code, we will get the same data points going to our train and test datasets, respectively.Question:is fixing the random seed suﬃcient for reproducibility?33val Array(trainDF, testDF) = airbnbDF.randomSplit(Array(.8, .2), seed=42)

Designing machine learning pipelinesCreating training and test datasetsWhat happens if we change the number of executors in our Spark cluster? The Catalyst optimizer determines the optimal way to partition your data as a function of your cluster resources and size of your dataset.We know that…•Data in a Spark DataFrame is row-partitioned.•Each worker performs its split independently of the other workers.If the data in the partitions changes, the result of the split (by randomSplit()) won’t be the same.34

Designing machine learning pipelinesCreating training and test datasetsAlthough you could fix your cluster configuration and your seed to ensure that you get consistent results…Our recommendation is…•Split your data once.•write it out to its own train/test folder.In this way, you won’t have these reproducibility issues.35

Designing machine learning pipelinesCreating training and test datasetsDuring your exploratory analysis, you should cache the training set.•It’s because you will be accessing it many times throughout the machine learning process.36

Designing machine learning pipelinesPreparing features with transformersLet’s prepare the data to build a linear regression model predicting price given the number of bedrooms.Linear regression (like many other algorithms in Spark) requires that all the input features are contained within a single vector in your DataFrame.Thus, we need to transform our data.37

Designing machine learning pipelinesPreparing features with transformersTransformers accept a DataFrame as input and return a new DataFrame with one or more columns appended to it.•They do not learn from your data, but apply rule-based transformations using the transform()method.38

Designing machine learning pipelinesPreparing features with transformersWe will use the VectorAssemblertransformer to put all of our features into a single vector.•It takes a list of input columns and creates a new DataFrame with an additional column, which we will call features.For now, let’s just use the “bedrooms” column.39import org.apache.spark.ml.feature.VectorAssembler val vecAssembler = new VectorAssembler() .setInputCols(Array("bedrooms")) .setOutputCol("features") val vecTrainDF = vecAssembler.transform(trainDF)

Designing machine learning pipelinesPreparing features with transformers40scala> vecTrainDF.select("bedrooms", "features", "price").show(10) +--------+--------+-----+ |bedrooms|features|price| +--------+--------+-----+ | 1.0| [1.0]|200.0| | 1.0| [1.0]|130.0| | 1.0| [1.0]| 95.0| | 1.0| [1.0]|250.0| | 3.0| [3.0]|250.0| | 1.0| [1.0]|115.0| | 1.0| [1.0]|105.0| | 1.0| [1.0]| 86.0| | 1.0| [1.0]|100.0| | 2.0| [2.0]|220.0| +--------+--------+-----+

Designing machine learning pipelinesUnderstanding linear regressionLinear regression models a linear relationship between your dependent variable (i.e., label) and one or more independent variables (i.e., features).•In our case, we want to fit a linear regression model to predict the price of an Airbnb rental given the number of bedrooms.41

Designing machine learning pipelinesUnderstanding linear regressionIn this figure…•Dots — true pairs from our dataset.•Line — the line of best fit for this dataset.•Vertical bar — errors (a.k.a.residuals) between our model predictions and the true values.(x,y)42

Designing machine learning pipelinesUnderstanding linear regressionThe process of estimating the coeﬃcients and intercept for our model is called learning (or fitting) the parameters for the model.•We can think of linear regression as fitting a model to .The goal of linear regression is to find a line that minimizes the square of these residuals.•The line can extrapolate predictions for data points it hasn’t seen.y≈mx+b+ε43

Designing machine learning pipelinesUnderstanding linear regressionLinear regression can also be extended to handle multiple independent variables.If we had three features as input, , we could model as .x= [x1,x2,x3]yy≈w0+w1x1+w2x2+w3x3+ε44

Designing machine learning pipelinesUsing estimators to build modelsAfter setting up our VectorAssembler, we have our data prepared and transformed into a format that our linear regression model expects.In Spark, LinearRegressionis a type of estimator.•It takes in a DataFrame and returns a Model.Estimators learn parameters from your data using the fit()method.•They are eagerly evaluated (i.e., kick offSpark jobs immediately).•By contrast, transformers are lazily evaluated.45

Designing machine learning pipelinesUsing estimators to build modelsExamples of estimators…•Imputer •LinearRegression •DecisionTreeClassifier •RandomForestRegressor •…46

Designing machine learning pipelinesUsing estimators to build modelsThe input column for LinearRegression(features) is the output from our VectorAssembler…lr.fit()returns a LinearRegressionModel, which is a transformer.•Once the estimator has learned the parameters, the transformer can apply these parameters to new data points to generate predictions.47import org.apache.spark.ml.regression.LinearRegression val lr = new LinearRegression() .setFeaturesCol("features") .setLabelCol("price") val lrModel = lr.fit(vecTrainDF)

Designing machine learning pipelinesUsing estimators to build modelsLet’s inspect the parameters it learned…48val m = lrModel.coefficients(0) val b = lrModel.intercept scala> println(f"""The formula for the linear regression line is price = $m%1.2f*bedrooms + $b%1.2f""") The formula for the linear regression line is price = 123.68*bedrooms + 47.51

Designing machine learning pipelinesCreating a pipelineIf we want to apply our model to our test set, then we need to prepare that data in the same way as the training set (i.e., pass it through the vector assembler).Oftentimes, data preparation pipelines will have multiple steps, and it becomes cumbersome to remember not only which steps to apply, but also the ordering of the steps.This is the motivation for the Pipeline API…•You simply specify the stages you want your data to pass through, in order.•Spark takes care of the processing for you.The Pipeline API provides the user with better code reusability and organization.49

Designing machine learning pipelinesCreating a pipelineIn Spark, Pipelinesare estimators, whereas PipelineModels(i.e., fitted Pipelines) are transformers.Let’s build our pipeline now…Another advantage of using the Pipeline API is that it determines which stages are estimators/transformers for you, so you don’t have to worry about specifying .fit()versus .transform()for each of the stages.50import org.apache.spark.ml.Pipeline val pipeline = new Pipeline().setStages(Array(vecAssembler, lr)) val pipelineModel = pipeline.fit(trainDF)

Designing machine learning pipelinesCreating a pipelineSince pipelineModelis a transformer, let’s apply it to our test set, too…51val predDF = pipelineModel.transform(testDF) scala> predDF.select("bedrooms", "features", "price", "prediction").show(10) +--------+--------+------+------------------+ |bedrooms|features| price| prediction| +--------+--------+------+------------------+ | 1.0| [1.0]| 85.0|171.18598011578285| | 1.0| [1.0]| 45.0|171.18598011578285| | 1.0| [1.0]| 70.0|171.18598011578285| | 1.0| [1.0]| 128.0|171.18598011578285| | 1.0| [1.0]| 159.0|171.18598011578285| | 2.0| [2.0]| 250.0|294.86172649777757| | 1.0| [1.0]| 99.0|171.18598011578285| | 1.0| [1.0]| 95.0|171.18598011578285| | 1.0| [1.0]| 100.0|171.18598011578285| | 1.0| [1.0]|2010.0|171.18598011578285| +--------+--------+------+------------------+

Designing machine learning pipelinesCreating a pipelineIn this example, we built a model using only a single feature, “bedrooms.”However, you may want to build a model using all of your features.•Some features may be categorical, such as “host_is_superhost.”Categorical features take on discrete values and have no intrinsic ordering.•Examples:occupations, country names, …Let’s consider a solution for how to treat these kinds of variables.52

Designing machine learning pipelinesOne-hot encodingMost machine learning models in MLlib expect numerical values as input, represented as vectors.To convert categorical values into numeric values, we can use a technique called one-hot encoding (OHE).53

Designing machine learning pipelinesOne-hot encodingSuppose we have a column called Animaland we have three types of animals: Dog, Cat, and Fish.We can’t pass the string types into our ML model directly, so we need to assign a numeric mapping, such as this…Question:any issues with this approach?54Animal = {"Dog", "Cat", "Fish"} "Dog" = 1, "Cat" = 2, "Fish" = 3

Designing machine learning pipelinesOne-hot encodingUsing this approach, we’ve introduced some spurious relationships into our dataset that weren’t there before.•Example:why did we assign Cattwice the value of Dog?The numeric values we use should not introduce any relationships into our dataset. Instead, we want to create a separate column for every distinct value in our Animalcolumn… (the ordering of the columns is irrelevant.)55"Dog" = [1, 0, 0] "Cat" = [0, 1, 0] "Fish" = [0, 0, 1]

Designing machine learning pipelinesOne-hot encodingYou might be wondering…If we had a zoo of 300 animals, would OHE massively increase consumption of memory/compute resources?•Not with Spark!Spark internally uses a SparseVectorwhen the majority of the entries are 0, so it does not waste space storing 0 values.•Example:the following two vectors represent the same data.56DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0) SparseVector(10, [3, 5], [7, 2])

Designing machine learning pipelinesOne-hot encodingThere are a few ways to one-hot encode your data with Spark.•A common approach is to use the StringIndexerand OneHotEncoder.•Another approach is to use RFormula.Let’s look at the first approach.57

Designing machine learning pipelinesOne-hot encodingFirst, apply the StringIndexerestimator to convert categorical values into category indices.•These category indices are ordered by label frequencies.•The most frequent label gets index 0.•It provides us with reproducible results across various runs of the same data.Next, you can pass those as input to the OneHotEncoder.•It maps a column of category indices to a column of binary vectors.58

Designing machine learning pipelinesOne-hot encodingNote:there are some differences in the StringIndexerand OneHotEncoderAPIs from Spark 2.3/2.4 to 3.0.59Spark 2.3 and 2.4Spark 3.0StringIndexerSingle column as input/outputMultiple columns as input/outputOneHotEncoderDeprecatedMultiple columns as input/outputOneHotEncoderEstimatorMultiple columns as input/outputNot available

Designing machine learning pipelinesOne-hot encoding60import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} val categoricalCols = trainDF.dtypes.filter(_._2 == "StringType").map(_._1) val indexOutputCols = categoricalCols.map(_ + "_index") val oheOutputCols = categoricalCols.map(_ + "_OHE") val stringIndexer = new StringIndexer() .setInputCols(categoricalCols) .setOutputCols(indexOutputCols) .setHandleInvalid("skip") val oheEncoder = new OneHotEncoder() .setInputCols(indexOutputCols) .setOutputCols(oheOutputCols) val numericCols = trainDF.dtypes.filter{ case (field, dataType) => dataType == "DoubleType" && field != "price"}.map(_._1) val assemblerInputs = oheOutputCols ++ numericCols val vecAssembler = new VectorAssembler() .setInputCols(assemblerInputs) .setOutputCol("features")

Designing machine learning pipelinesOne-hot encodingYou might be wondering…How does the StringIndexerhandle new categories that appear in the test set, but not in the training set?There is a handleInvalidparameter that specifies how you want to handle them. The options are…•skip— filter out rows with invalid data.•error— throw an error.•keep— put invalid data in a special additional bucket.61

Designing machine learning pipelinesOne-hot encodingAnother approach to one-hot encode your data is to use RFormula.•The syntax for this is inspired by the R programming language.•See https://spark.apache.org/docs/latest/ml-features.html#rformula.You provide your label and which features you want to include.62import org.apache.spark.ml.feature.RFormula val rFormula = new RFormula() .setFormula("price ~ .") .setFeaturesCol("features") .setLabelCol("price") .setHandleInvalid("skip")

Designing machine learning pipelinesOne-hot encodingUnder the hood, RFormulawill automatically…•StringIndex and OHE all of your string columns.•Convert your numeric columns to double type.•Combine all of these into a single vector using VectorAssembler.63

Designing machine learning pipelinesOne-hot encodingThe downside of RFormulaautomatically combining the StringIndexerand OneHotEncoderis that one-hot encoding is not required or recommended for all algorithms.•Example:tree-based algorithms can handle categorical variables directly if you just use the StringIndexerfor the categorical features.Unfortunately, there is no one-size-fits-all solution for feature engineering, and the ideal approach is closely related to the downstream algorithms you plan to apply to your dataset.•If someone else performs the feature engineering for you, make sure they document how they generated those features.64

Designing machine learning pipelinesOne-hot encodingOnce you’ve written the code to transform your dataset, you can add a linear regression model using all of the features as input.Let’s put all the feature preparation and model building into the pipeline…65val lr = new LinearRegression() .setLabelCol("price") .setFeaturesCol("features") val pipeline = new Pipeline() .setStages(Array(stringIndexer, oheEncoder, vecAssembler, lr)) // Or use RFormula// val pipeline = new Pipeline().setStages(Array(rFormula, lr))

Designing machine learning pipelinesOne-hot encodingLet’s apply the pipeline to our dataset…66val pipelineModel = pipeline.fit(trainDF) val predDF = pipelineModel.transform(testDF) scala> predDF.select("features", "price", "prediction").show(5) +--------------------+-----+------------------+ | features|price| prediction| +--------------------+-----+------------------+ |(98,[0,3,6,7,23,4...| 85.0| 55.80250714362137| |(98,[0,3,6,7,23,4...| 45.0| 22.74720286761658| |(98,[0,3,6,7,23,4...| 70.0|27.115811183814913| |(98,[0,3,6,7,13,4...|128.0|-91.60763412465076| |(98,[0,3,6,7,13,4...|159.0| 94.70374072351933| +--------------------+-----+------------------+

Designing machine learning pipelinesOne-hot encodingHow is our model performing? You can see that while some of the predictions might be considered “close,” others are far off(a negative price for a rental!?).Next, we’ll numerically evaluate how well our model performs across our entire test set.67

Designing machine learning pipelinesEvaluating modelsIn spark.ml, there are classification, regression, clustering, and ranking evaluators.Since this is a regression problem, we will use root-mean-square error (RMSE) and to evaluate our model’s performance.R268

Designing machine learning pipelinesEvaluating models — RMSERMSE (root-mean-square error) is a metric that ranges from 0 to .•The closer it is to zero, the better.Mathematically, it’s defined as…•is the true value.•is the predicted value.+∞RMSE=1nn∑i=1(yi−̂yi)2yîyi69

Designing machine learning pipelinesEvaluating models — RMSELet’s evaluate our model using RMSE…70import org.apache.spark.ml.evaluation.RegressionEvaluator val regressionEvaluator = new RegressionEvaluator() .setPredictionCol("prediction") .setLabelCol("price") .setMetricName("rmse") val rmse = regressionEvaluator.evaluate(predDF) scala> println(f"RMSE is $rmse%.1f") RMSE is 220.6

Designing machine learning pipelinesEvaluating models — RMSESo how do we know if 220.6 is a good value for the RMSE? There are various ways to interpret this value.•One way is to build a simple baseline model and compute its RMSE to compare against.A common baseline model for regression tasks is to compute the average value of the label on the training set , then predict for every record in the test set and compute the resulting RMSE.•If you don’t beat the baseline, then something probably went wrong in your model building process.¯y¯y71

Designing machine learning pipelinesEvaluating models — RMSEKeep in mind that the unit of your label directly impacts your RMSE.•Example:if your label is height, then your RMSE will be higher if you use inches rather than feet as your unit of measurement.You could arbitrarily decrease the RMSE by using a different unit…•The raw value of RMSE is meaningless.•It is important to compare your RMSE against a baseline.72

Designing machine learning pipelinesEvaluating models — R2values range from to 1.•Don’t be confused with the “squared” in its name.Mathematically, it’s defined as…•is the true value.•is the predicted value.R2−∞R2= 1−∑ni=1(yi−̂yi)2∑ni=1(yi−¯y)2yîyi73

Designing machine learning pipelinesEvaluating models — R2If your model perfectly predicts every data point…•Your .If your model performs the same as always predicting the average value …•Your .If your model performs worse than always predicting …•Your is negative.•You should reëvaluate your modeling process.The nice thing about using is that you don’t necessarily need to define a baseline model to compare against.R2= 1¯yR2= 0¯yR2R274R2= 1−∑ni=1(yi−̂yi)2∑ni=1(yi−¯y)2

Designing machine learning pipelinesEvaluating models — R2If we want to change our regression evaluator to use , instead of redefining the regression evaluator, we can set the metric name using the setter property…R275val r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF) scala> println(s"R2 is $r2”) R2 is 0.159854

Designing machine learning pipelinesEvaluating models — R2Our is positive, but it’s very close to 0.One of the reasons why our model is not performing too well is because our label, “price,” appears to be log-normally distributed.•It means that if we take the logarithm of the value, the result looks like a normal distribution.Price is often log-normally distributed.•If you think about rental prices in San Francisco, most cost around $200 per night, but there are some that rent for thousands of dollars a night!R276

Designing machine learning pipelinesEvaluating models — R2Exercise:build a model to predict price on the log scale, then exponentiate the prediction to get it out of log scale and evaluate your model.77

Designing machine learning pipelinesSaving and loading modelsNow that we have built and evaluated a model, let’s save it to persistent storage for reuse later.•In the event that our cluster goes down, we don’t have to recompute the model.78

Designing machine learning pipelinesSaving and loading modelsSaving models is very similar to writing DataFrames…•The API is model.write.save(path).•You can optionally provide the overwrite()command to overwrite any data contained in that path.79val pipelinePath = "/tmp/lr-pipeline-model" pipelineModel.write.overwrite().save(pipelinePath)

Designing machine learning pipelinesSaving and loading modelsWhen you load your saved models, you need to specify the type of model you are loading back in.•For this reason, it is recommended that you always put your transformers/estimators into a Pipeline, so that you always load a PipelineModel.80import org.apache.spark.ml.PipelineModel val savedPipelineModel = PipelineModel.load(pipelinePath)

AgendaWhat is machine learning?Designing machine learning pipelinesHyperparameter tuning81

Hyperparameter tuningA hyperparameter is an attribute that you define about the model prior to training.•It controls the learning process or structure of your model.•It’s not learned during the training process.•By contrast, parameters are learned in the training process.Example:hyperparameters of random forests.•The max depth of a decision tree.•The number of trees in a random forest.82

Hyperparameter tuningTraining, validation, and test datasetsWhen people talk about tuning their models, they often discuss tuning hyperparameters to improve the model’s predictive power.Which dataset should we use to determine the optimal hyperparameter values?•If we use the training set, then the model is likely to overfit, or memorize the nuances of our training data.•If we use the test set, then that will no longer represent “unseen” data, so we won’t be able to use it to verify how well our model generalizes.Thus, we need another data set to help us determine the optimal hyperparameters: the validation dataset.83

Hyperparameter tuningTraining, validation, and test datasetsInstead of splitting our data into an 80/20 train/test split, we can do a 60/20/20 split to generate training, validation, and test datasets, respectively.Then, we can…•Build our model on the training set.•Evaluate performance on the validation set to select the best hyperparameter configuration.•Apply the model to the test set to see how well it performs on new data.However, we lose 25% of our training data (80% →60%), which could have been used to help improve the model.84

Hyperparameter tuningk-fold cross-validationWe can use of the -fold cross-validation technique to solve this problem.Instead of splitting the data set into separate training, validation, and test sets…•We split it into training and test sets as before.•But we use the training data for both training and validation.k85

Hyperparameter tuningk-fold cross-validationFirst, we split our training data into subsets (“folds”).For a given hyperparameter configuration, we…•Repeat the following process times…•Train our model on folds.•Evaluate on the remaining fold.•Average the performance of those validation sets.•It’s used as a proxy of how well this model will perform on unseen data.We repeat this process for all of our different hyperparameter configurations to identify the optimal one.kkk−1k86

Hyperparameter tuningk-fold cross-validationTo perform a hyperparameter search in Spark, take the following steps…1.Define the estimatoryou want to evaluate.2.Specify which hyperparameters you want to vary, as well as their respective values, using the ParamGridBuilder.3.Define an evaluatorto specify which metric to use to compare the various models.4.Use the CrossValidatorto perform cross-validation, evaluating each of the various models.87

Hyperparameter tuningk-fold cross-validationExample:suppose we want to build a random forest model to predict the price.We want to tune two hyperparameters…•The max depth of the decision trees.•The number of trees in our random forest.88import org.apache.spark.ml.regression.RandomForestRegressor val rf = new RandomForestRegressor() .setLabelCol("price") .setMaxBins(40) .setSeed(42)Price-1Price-2Price-nAverageFinal Price

Hyperparameter tuningk-fold cross-validationStep 1:define our pipeline estimator…89val pipeline = new Pipeline() .setStages(Array(stringIndexer, vecAssembler, rf))

Hyperparameter tuningk-fold cross-validationStep 2:for our ParamGridBuilder, let’s vary…•maxDepth— 2, 4, 6.•numTrees— 10, 100.This will give us a grid of 6 different hyperparameter configurations in total.90import org.apache.spark.ml.tuning.ParamGridBuilder val paramGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(2, 4, 6)) .addGrid(rf.numTrees, Array(10, 100)) .build()

Hyperparameter tuningk-fold cross-validationStep 3:define how to evaluate each of the models to determine which one performed best.•For this task, let’s use the RegressionEvaluator, and let’s use RMSE as our metric of interest…91val evaluator = new RegressionEvaluator() .setLabelCol("price") .setPredictionCol("prediction") .setMetricName("rmse")

Hyperparameter tuningk-fold cross-validationStep 4:perform our -fold cross-validation using the CrossValidatorwith…•estimator— which model to use.•evaluator— how to evaluate the model.•estimatorParamMaps— which hyperparameters to set for the model.Let’s fit this cross-validator to our training set…k92import org.apache.spark.ml.tuning.CrossValidator val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) .setSeed(42) val cvModel = cv.fit(trainDF)

Hyperparameter tuningk-fold cross-validationTo inspect the results of the cross-validator, take a look at the avgMetrics…93scala> cvModel.getEstimatorParamMaps.zip(cvModel.avgMetrics) res0: Array[(org.apache.spark.ml.param.ParamMap, Double)] = Array(({ rfr_a132fb1ab6c8-maxDepth: 2, rfr_a132fb1ab6c8-numTrees: 10 },303.99522869739343), ({ rfr_a132fb1ab6c8-maxDepth: 2, rfr_a132fb1ab6c8-numTrees: 100 },299.56501993529474), ({ rfr_a132fb1ab6c8-maxDepth: 4, rfr_a132fb1ab6c8-numTrees: 10 },310.63687030886894), ({ rfr_a132fb1ab6c8-maxDepth: 4, rfr_a132fb1ab6c8-numTrees: 100 },294.7369599168999), ({ rfr_a132fb1ab6c8-maxDepth: 6, rfr_a132fb1ab6c8-numTrees: 10 },312.6678169109293), ({ rfr_a132fb1ab6c8-maxDepth: 6, rfr_a132fb1ab6c8-numTrees: 100 },292.101039874209))

Hyperparameter tuningOptimizing pipelinesAlthough each of the models in the cross-validator is independent, spark.mltrains the collection of models sequentially rather than in parallel.In Spark 2.3, a parallelismparameter was introduced to solve this problem.•It determines the maximum level of parallelism to evaluate models in parallel.The value of parallelismshould be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be suﬃcient for most clusters. — ML Tuning: model selection and hyperparameter tuning94

Hyperparameter tuningOptimizing pipelinesLet’s set this value to 4 and see if we can train any faster…95val cvModel = cv.setParallelism(4).fit(trainDF)

Hyperparameter tuningOptimizing pipelinesThere’s another trick we can use to speed up model training…We can put the cross-validator inside the pipeline instead of putting the pipeline inside the cross-validator.•Every time the cross-validator evaluates the pipeline, it runs through every step of the pipeline for each model, even if some of the steps don’t change, such as the StringIndexer.•If instead we put our cross-validator inside our pipeline, then we won’t be reëvaluating the StringIndexer(or any other estimator) each time we try a different model.96

Hyperparameter tuningOptimizing pipelinesLet’s put our cross-validator inside our pipeline…97val cv = new CrossValidator() .setEstimator(rf) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) .setParallelism(4) .setSeed(42) val pipeline = new Pipeline() .setStages(Array(stringIndexer, vecAssembler, cv)) val pipelineModel = pipeline.fit(trainDF)

SummaryWe have learned how to build machine learning pipelines using Spark MLlib…•How to explore and clean a real-world dataset?•What are the differences between transformers and estimators?•How to compose transformers and estimators using the Pipeline API?•How to evaluate models?•How to use cross-validation to perform hyperparameter tuning?•How to optimize cross-validation and model training?98