书名：Frank Kane's Taming Big Data with Apache Spark and Python
作者名：Frank Kane
本章字数：262字
更新时间：2025-02-20 14:21:24

What is the RDD?

Let's talk about the RDD in a reverse order because I'm weird like that. So, fundamentally, the RDD is a dataset, and it is an abstraction for a giant set of data, which is the main thing you need to know as a developer. What you'll do is to set up RDD objects and the RDD will load them up with big datasets and then call various methods on the RDD objects to distribute the processing of that data. Now the beauty is that although RDDs are both "distributed" and "resilient," you don't really have to worry about those things. RDDs can be spread out across an entire cluster of computers that may or may not be running locally, they can also handle the failure of specific executor nodes in your cluster automatically, keep on going even if one node shuts down, and redistribute the work as needed when that occurs. You don't have to think about these things though, that's what Spark does for you and that's what your cluster manager does for you. So even though an RDD is distributed and resilient, which is a very powerful thing, you don't have to worry about the specifics about how that works because it's kind of magic. All you need to really know as a developer is that it represents a really big dataset and you can use the RDD object to transform that dataset from one set of data to another or to perform actions on that dataset to actually get the results you want from it.