Apache Spark with Python (1) — 安裝篇

1 min readJan 17, 2020

環境 - Windows 10

Spark基於Scala, Scala基於java, 故先要安裝java

#本步驟為 https://www.udemy.com/course/taming-big-data-with-apache-spark-hands-on/課程之安裝筆記

先安裝Java JDK — jdk-8u241-windows-x64(不要太新的版本)
注意: 由於spark的一些bug,安裝目錄不要有其他層資料夾，請改預設目錄建議直接安裝在C:\jdk這樣即可https://www.oracle.com/technetwork/java/javase/downloads/index.html
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
https://spark.apache.org/downloads.html

3. 下載後直接用7ZIP解壓縮即可再放置C:\spark底下
進去conf把log4j.properties的.template後面副檔名拿掉
用文字文件開啟去找
# Set everything to be logged to the console 改成如下
log4j.rootCategory=ERROR, console

4.
下載http://media.sundog-soft.com/Udemy/winutils.exe
直接放在C:\winutils\bin 底下

並創立C:\tmp\hive
進去Cmd C:\winutils\bin>winutils.exe chmod777 \tmp\hive

5. 設定環境變數
Add %JAVA_HOME% = C:\jdk
Add %SPARK_HOME% = C:\spark
Add %HADOOP_HOME% = C:\winutils
PATH變數增加
%JAVA_HOME%\bin
%SPARK_HOME%\bin

打開cmd 輸入pyspark，若有出現圖標就算有成功啦
可以試著輸入以下指令

>>> rdd = sc.textFile(“C:\spark\README.md”)
>>> rdd.count()
105

心得: Spark的環境非常難設定... 常常搞了半天還是某個地方出錯，只能說版本非常重要... 雖然Spark3已經出了，但是我還是不太敢升級，至少先求穩定比較重要

系列文:
Apache Spark with Python (1) — 安裝篇
 Apache Spark with Python (2) — 概念篇
 Apache Spark with Python (3) — 實作篇
 Apache Spark with Python (4) — 電影資料集處理
 Apache Spark with Python(5) — Collaborative filtering 電影資料集推薦系統實戰

Apache Spark with Python (1) — 安裝篇

Written by Jimmy Huang

No responses yet