Midterm Exam (Spring 2024)unlocked

.pdf

School

New York University**We aren't endorsed by this school

Course

CSCI-GA 2437

Subject

Computer Science

Date

Dec 18, 2024

Pages

Uploaded by MagistrateBookJellyfish291

CSCI-GA.2437 Big Data Application DevelopmentMidterm ExamSpring 2024March 12, 2024, 7:10 p.m. — 9:10 p.m.This is a closed-book exam. You are allowed to bring a double-sided 8.5⨯11” cheat sheet.AdvicePlease manage your 120 minutes wisely. The following is a suggested time allotment.The questions are notordered by difficulty. Please scan all questions before starting to answer them. Don’t waste too much time on any single question.Please write your answers clearly in the designated boxes. You may lose points if the instructor cannot recognize your handwriting.Question 1 (14 points)Multiple choicesFor each question, there will be exactly onecorrect statement. You will get 1 pointif your answer is correct or 0 pointsif your answer is wrong.You must write your answer in the box next to each question. Your answer must be one of the following: A, B, C, or D. NameNetIDQuestion12345ExtraTotalPoints1424203210Bonus 1100Time9 min16 min25 min35 min25 min10 min120 minCSCI-GA.2437Midterm ExamPage of 116

1-1.Which of the following is NOT one of the four properties of autonomic computing defined by IBM?1-2.In Scala, which of the following types is actually a Java type?1-3.Which of the following is NOT allowed in Scala?1-4.Suppose sis a String in Scala. Which of the following expressions does NOT give its first character?1-5.Which of the following statements about RDD is NOT correct?1-6.Which of the following RDD operations is lazily evaluated?A. Self-configuration.B.Self-healing.C.Self-programming.D. Self-protection.Answer:A. ListB.MapC.StringD. TupleAnswer:A.Define a function that takes another function as parameter.B.Define a function that returns another function as return value.C.Define a function inside another function.D.Define a function and a variable with the same name in a class.Answer:A. s(0)B.s[0]C.s.apply(0)D. s.charAt(0)Answer:A.RDD stands for Responsive Distributed Dataset.B.An RDD is a collection of elements partitioned across the cluster.C.An RDD can be operated on in parallel.D.RDDs automatically recover from node failures.Answer:A. countByKeyB.distinctC.foldD. foreachAnswer:CSCI-GA.2437Midterm ExamPage of 216

1-7.Which of the following RDD operations cannot introduce a new stage?1-8.Which kind of task has the lowest priority of being assigned to an executor?1-9.Which of the following RDD operations will produce a result with no partitioner, even if the parent RDD has a partitioner?1-10.Which of the following RDD operations will benefit from partitioning?1-11.Which of the following sequences of RDD operations is NOT possible in Spark?1-12.Which of the following chains of RDD operations yields a range-partitioned RDD?A. joinB.keyByC.reduceByKeyD. repartitionAnswer:A.Speculative taskB.Process-local taskC.Node-local taskD.Non-local taskAnswer:A. cogroupB.filterC.flatMapD. mapValuesAnswer:A. filterB.flatMapValuesC.lookupD. unionAnswer:A. filter ⇒keyBy ⇒reduceByKey ⇒mapValuesB.flatMap ⇒distinct ⇒collectC.map ⇒reduce ⇒takeSampleD. join ⇒values ⇒groupByKey ⇒flatMapValuesAnswer:A. sortByKey ⇒map ⇒aggregateByKeyB.map ⇒sortByKey ⇒filterC.filter ⇒flatMap ⇒groupByKeyD. flatMap ⇒join ⇒foldByKeyAnswer:CSCI-GA.2437Midterm ExamPage of 316

1-13.Some kinds of dependencies, such as join, are sometimes wide and sometimes narrow. Which of the following statements is correct about joining two RDDs?1-14.Which of the following statements is correct about RDD dependencies?Question 2 (24 points)True or falseFor each statement, decide whether it is trueor false. You will get 1 pointif your answer is correct or 0 pointsif your answer is wrong. You can also answer “I don’t know” to get 0.5 points, so you don’t need to try your luck.You must write your answer in the designated cell to the right of each question. Your answer must be one of the following: True, False, or I don’t know.A.Join is narrow when executed lazily, and wide when forced to materialize by an action.B.Join is narrow when its two input RDDs resulted from narrow dependencies, and wide otherwise.C.Join is narrow when the preceding RDDs are partitioned by key in the same way and the join is joining on that key, and wide otherwise.D.Join is narrow when the preceding RDDs are persistent, and wide otherwise.Answer:A.A narrow dependency can be executed lazily, while a wide dependency must be executed immediately.B.A narrow dependency requires input RDD data to be key-value pairs, while a wide dependency’s input can be arbitrary records.C.A narrow dependency is likely to execute more slowly than a wide dependency.D.If a worker fails, narrow RDD partitions for which it was responsible may have to be recomputed starting with the original inputs.Answer:StatementAnswer2-1.A pure function cannot print anything on the screen.2-2.Scala has only one primitive type — Unit.2-3.A Scala tuple can contain any number of elements of any type of data.CSCI-GA.2437Midterm ExamPage of 416

2-4.A Scala Vector is represented as a linked list.2-5.Suppose you have defined “var a = 0; val b = 0;” in Scala. Then, you can do “a++” but not “b++”.2-6.Anything that can be done in MapReduce can also be done in Spark.2-7.In Spark, any operation that works on an RDD must also work on a Pair RDD.2-8.Executors are long-running processes shared across multiple Spark applications.2-9.Executors send status updates to the driver during the execution of a Spark job.2-10.An executor runs tasks in its own JVM rather than creating new JVMs.2-11.In Spark, a shuffle operation redistributes data so that it’s grouped differently across partitions.2-12.Spark automatically caches intermediate data in shuffle operations.2-13.After an action is called, the Spark driver first runs the task scheduler and then the DAG scheduler.2-14.In a Spark job, each transformation belongs to exactly one stage.2-15.Spark cannot run an iterative algorithm in a single job because a transformation cannot be chained after an action.2-16.Suppose ais of type RDD[(Int, Int)]. Calling “a.sortByKey().first()” has the same effect as calling “a.takeOrdered(1)”.2-17.A narrow dependency means each partition of the child RDD uses data from no more than one partition of the parent RDD.StatementAnswerCSCI-GA.2437Midterm ExamPage of 516

Question 3 (20 points)Short answersWrite your answer in the box under each question. Each question is worth 4 points.3-1.In Spark, why are shuffle operations more expensive than non-shuffle operations?2-18.Invoking the default cache()on an RDD avoids having to recompute the RDD using the lineage graph from its inputs when a node fails.2-19.The Spark driver can send a broadcast variable to all executors, and the executors can update the value of the broadcast variable and send it back to the driver.2-20.An accumulator in the map()function is not executed until an action is called.2-21.An executor can access the current result of an accumulator at any time.2-22.In Spark, the data of each RDD is stored on a single node, and different RDDs may be stored on different nodes.2-23.The mapPartitions()operation can be used to implement the filter()operation.2-24.The reduceByKey()operation may produce different values if an RDD is repartitioned.StatementAnswerCSCI-GA.2437Midterm ExamPage of 616

3-2.Consider the deployment of a Spark application. Where does the driver program run in the client mode versus cluster mode? Which mode is required for an interactive Spark program, and why?3-3.Suppose “rdd” is an RDD of integer numbers:•val rdd = sc.parallelize(Array(1, 2, 3, 4)) Your friend Alice wants to calculate the sum of all elements in this RDD, so she writes the following program:1 var sum = 0 2 rdd.foreach(x => sum += x) 3 println(sum) What will Alice’s program print out? Why?CSCI-GA.2437Midterm ExamPage of 716

3-4.Suppose “netids” is an RDD of NetIDs read from HDFS, and “students” is a Scala Map containing the mapping from NetIDs to the full names:1 val netids = sc.textFile(…) 2 val students = Map(“ax123” -> “Alice Anderson”, “by456” -> “Bob Brown”, “cz789” -> “Carol Clark”, …) Your friend Bob wants to augment the RDD with full names, so he writes the following program:•val result = netids.map(netid => (netid, students(netid))) Suppose “students” is a large Map that takes up 300MB of memory. What’s the downside of Bob’s program? How can you improve it?3-5.When you call saveAsTextFile()to save an RDD as a text file, sometimes more than one file is written to the file system. Why does that happen?CSCI-GA.2437Midterm ExamPage of 816

Question 4 (32 points)PageRankThe PageRank algorithm iteratively updates a rank for each document by adding up contributions from documents that link to it. On each iteration, each document sends a contribution of to its neighbors, where is its rank and is its number of neighbors. It then updates its rank to , where the sum is over the contributions it received and is the total number of documents.Suppose this algorithm is run for three iterations. Here is its implementation in Spark:01 val links = sc.textFile(…).map(…) //RDD of (URL, outlinks) pairs02 var ranks = … //RDD of (URL, rank) pairs03 var iteration = 1 04 while (iteration <= ITERATIONS) { 05 // Build an RDD of (targetURL, float) pairs with the contributions sent by each page06 val contribs = links.join(ranks).flatMap { 07 case (url, (links, rank)) => 08 links.map(dest => (dest, rank / links.size)) 09 } 10 // Sum contributions by URL and get new ranks11 ranks = contribs.reduceByKey((x, y) => x + y) 12 .mapValues(sum => a/N + (1-a)*sum) 13 iteration += 1 14 } Based on the above information, answer the following questions. (Note: these questions are notabout the PageRank algorithm itself and are notordered by difficulty.)Write your answer in the box under each question. Each question is worth 4 points.r/nrnα/N+ (1−α)∑ciNCSCI-GA.2437Midterm ExamPage of 916

4-1.Draw the lineage graph for all RDDs in the above code. (For this Question 4-1only, suppose ITERATIONS= 3.)4-2.What’s the main reason that Spark can run the PageRank algorithm more efficiently than Hadoop MapReduce?CSCI-GA.2437Midterm ExamPage of 1016

4-3.Your friend Carol is using the above code to compute PageRank on a huge dataset of web links. Her job takes hours to run, and she wants to find out which parts of the job take the most time. She inserts code that gets the wall-clock time before and after each statement of the above code, and prints the difference (i.e., prints how long that statement took to execute). She sees that the statements on Lines 6-9 and Lines 11-12 each take only a fraction of a second to execute, and that the entire “while” loop from Line 4 to Line 14 takes less than a second, even though the whole Spark job takes hours.Why does the “while” loop take so much less time than the whole Spark job?4-4.You want to help Carol improve the performance of her code. Suppose you can add a persist()operation to oneof the RDDs. Where should you put it to avoid the most computation? Write down the line number and how you would change that line of code, and explain why.CSCI-GA.2437Midterm ExamPage of 1116

4-5.To further improve the performance, suppose you can add a partitionBy()operation to oneof the RDDs. Which RDD should you apply it to, and why would it improve the performance of this algorithm?4-6.Rewrite the imperative while loop on Lines 3, 4, and 13 in the functional programming paradigm so that no varis used.CSCI-GA.2437Midterm ExamPage of 1216

4-7.Rewrite Line 12 using mapinstead of mapValues.4-8.Would there be any disadvantage if Line 12 were implemented using mapinstead of mapValues? Explain why.CSCI-GA.2437Midterm ExamPage of 1316

Question 5 (10 points)Online shopping analyticsSuppose you are running a large online shopping website, and you have two datasets: usersand transactions. Both are text files, where each line represents a record. Fields are separated by a comma (“,”), and all fields contain alphanumerical characters only.The usersdataset (users.txt) consists of two columns: user nameand location. You can assume that all user names are unique. Here is a snippet of users.txt:The transactionsdataset (transactions.txt) consists of three columns: transaction ID, user name, and product name. You can assume that all product names are unique, and each transaction contains exactly one product. Here is a snippet of transactions.txt:Your job is to find the number of unique locations where each product has been sold and save the output in the “shopping” directory on HDFS.In your output file(s), each line should represent a product, where the first field should be the product name, and the second field should be the number of unique locations where this product has been sold. Fields should be separated by a comma (“,”). The ordering of products is not important.For example, the above input data would result in the following output file(s):Alice,CA Bob,NY Carol,FL Dave,NY1,Alice,Xbox 2,Bob,LearningSpark 3,Alice,LearningSpark 4,Bob,PlayStation 5,Carol,Switch 6,Dave,LearningSpark 7,Alice,Switch 8,Dave,SwitchSwitch,3 PlayStation,1 Xbox,1 LearningSpark,2CSCI-GA.2437Midterm ExamPage of 1416

Write a Spark program to solve this problem in the box below.You need to provide the Scala codefor loading the data, computing the average ratings, and saving the results. Commentsare not required but would be helpful in getting partial credits.Note:The grading is notall-or-nothing. Show your effort to get partial credits.CSCI-GA.2437Midterm ExamPage of 1516

Bonus (1 extra point)In your opinion, how could this course be improved?You will get one extra point as long as you provide any non-empty feedback. However, your total score on this exam will not exceed 100 points.CSCI-GA.2437Midterm ExamPage of 1616

Please detach this page and use it as scratch paper. You do not need to turn in this page.