- Frank Kane's Taming Big Data with Apache Spark and Python
- Frank Kane
- 346字
- 2025-02-20 14:21:24
Setting up the SparkContext object
Next, we actually create our SparkContext, and this is going to look very similar in every Spark script that we write in Python. We have these two lines:
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram") sc = SparkContext(conf = conf)
So let's look at what's going on here. In the first line, we have the SparkConf object that we imported earlier and we're telling it to set its master node as the local machine. Basically, it says we're going to be running on the local box only, not on the cluster, just on this one system. There are extensions to local that tell you to actually split it up among multiple CPU cores that you might have on your machine, but this is a very simple example, so we're just going to run it in a single thread on a single process, which is what local means. So we're not really doing any sort of distribution of the data: in this case, it's just sort of running on one process to keep it simple for now and get the concepts across. Later on, we'll run more complicated jobs that actually use every core of your computer, and ultimately, we'll run a job on a real cluster using Elastic MapReduce.
Finally, in that first line, we need to set the app name as part of this call by typing .setAppName. We're going to call this app "RatingsHistogram". This is just so if you actually look in the Spark Web UI to actually see what's going on while it's running, you'll be able to look up the job by its name there and understand and identify it. Now this job runs too quickly for you to even see it in the UI, so we're not going to worry about that, but it is good practice to give every app a name. In the second line, using that Spark configuration object, we create our SparkContext object and assign that to something called sc:
sc = SparkContext(conf = conf)
By convention, we will always call that sc for SparkContext.