- Frank Kane's Taming Big Data with Apache Spark and Python
- Frank Kane
- 454字
- 2025-02-20 14:21:24
Creating RDDs
One simple example of using sc is that you can use the sc.parallelize function to take a hardcoded set of data and make an RDD out of it. But that's not very interesting, it's not really going to be useful in a real production setting because if you could hardcode the data, then it's not really a big dataset to begin with now, is it? More often, we'll use something like sc.txtFile to create an RDD object. So, for example, if I have a giant text file full of, oh I don't know, movie ratings data for example, on my hard drive, we can use sc.txtFile to create an RDD object from SparkContext and then we can just use that RDD object going forward and process it:
sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
Now, again, if I have a set of information that fits on my computer, that's not really big data either, you can also create a text file from an s3n location or from an HDFS URI. So these are both examples of distributed file systems that can handle much larger datasets, that we might be able to fit on one machine. You can just as easily use an s3n or an HDFS URI as well as the file URI to load up data from a cluster or from a distributed file system as well as from a simple file that might be running on the same machine as your driver script.
You can also create RDDs from Hive:
hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")
If you have a HiveContext object that's already been connected to an existing Hive repository, you can create an RDD from that. If you don't know what Hive is, don't worry about it: Hive is basically another thing that runs on top of Hadoop for data warehousing. You can also create RDDs from things, such as JDBC, you can tie it correctly to any SQL database that has a JDBC or ODBC interface. You can also use popular NoSQL databases such as Cassandra, and it has interfaces for things such as HBase and Elasticsearch and a lot of other things that are growing all the time. Basically, any data format that you can access from Python or from Java, depending on what language you're using, you can access through Spark as well, so you can load up JSON information and comma-separated value lists. You can also talk to things like sequence files and object files and load compressed formats directly. So there are a lot of ways to create an RDD; pretty much whatever format your source data might be in, the odds are that you can create an RDD from it in Spark pretty easily.