- Frank Kane's Taming Big Data with Apache Spark and Python
- Frank Kane
- 316字
- 2025-02-20 14:21:24
Mapping the values of a key/value RDD
Now this is a very important point with key/value RDDs: if you're not going to actually modify the keys from your transformations on the RDD, make sure you call mapValues() or flatMapValues() instead of just map or flatMap. This is important because it's more efficient; getting a little technical, it allows Spark to maintain the partitioning from your original RDD instead of having to shuffle the data around, which can be very expensive when you're running on a cluster.
So anytime you're calling map or flatMap on a key/value RDD, ask yourself, am I actually modifying the key values? If the answer is no, you should be calling mapValues() instead or flatMapValues().
Just to review again, mapValues will have a one-to-one relationship, so every element in your original RDD will be transformed into one new element, using whatever function you define. flatMap on the other hand can actually blow that out into multiple elements per original elements, so you can end up with a new RDD that's actually longer or contains more values than the original one with flatMapValues. Now, one thing to keep in mind is that if you're calling mapValues or flatMapValues, all that will be passed into your function that you're using to transform the RDD will be the values themselves. Don't take that to mean that the keys are getting discarded, they're not, they're just not being modified and not being exposed to you. So even though mapValues and flatMapValues will only receive one value, which is the value of the key/value pair, keep in mind the key is still there, it's still present, you're just not allowed to touch it. I realize this is a lot to digest, it's going to make a lot more sense when we look at a real example, so bear with me whilst we look at a real example.