Foro Formación Hadoop
Uso de SparkSQL contra Hive
Using SparkSQL
This section will show you how to enable Spark SQL on your Altiscale cluster with Spark. A few tweaks are required (and mostly are already performed by Altiscale) to SparkSQL to interact with the Hive metastore.
SparkSQL is like a Spark job; it inherits all the Spark job characteristics including eventLog directories, JVM settings, etc. — all the configuration defined in /etc/spark/
.
The following sections provide a few test cases on how to invoke SparkSQL commands and SQLContext and HiveContext APIs. Please note you'll need to include a pointer to the Hadoop LZO jar in your command line invocation as shown in the examples.
All examples are available under /opt/spark/test_spark/
in the Altiscale distribution.
Configuration
So first, you will need to enable SparkSQL on your cluster. To enable SparkSQL, currently, it is a bit manual, however, these manual step will be automated soon.
If you haven't installed Spark on your Workbench, you should follow the instructions at Installing Spark 1.3 or Installing Spark 1.4. If you have an older version of Spark, you will need to update /etc/spark/spark-env.sh
.
-
Copy
/etc/hive/hive-site.xml
to/etc/spark/
. This will be the global Hive configuration applied in Spark. If you need to customize your own Spark configuration, you may create your own configuration directory and point Spark to it when you submit your own Spark jobs.cd /etc/spark/
cp /etc/hive/hive-site.xml .
# Strip off any password or credentials here if applicable, we don't need it.
vim hive-site.xml
chown spark:spark hive-site.xml
chmod 555 hive-site.xml
# You should see the following permission in /etc/spark/
ls -al hive-site.xml
-r-xr-xr-x 1 spark spark 2053 Feb 12 02:56 hive-site.xml
-
Remove the four
javax.jdo.option.Connection<xxx>
properties from the JDO properties inhive-site.xml
if they appear in yourhive-site.xml
file. If you originally copied your/etc/spark/hive-site.xml
from/etc/hive-metastore/hive-site.xml
(as instructed in previous versions of this document) then these properties will be present and should be removed. If you copied from/etc/hive/hive-site.xml
, as specifed in step 1 above, then these properties will not be present.
(WARNING: The values shown here are examples, your hive-site.xml may vary).
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>**************</value>
</property>
-
Make sure the
hive.server2.authentication
property value is set to NONE.<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property> -
Remove/comment out all local logging properties that point to
/home,
such as scratch directories settings that won't work on remote containers (e.g. in yarn-cluster mode).
For example, you may see the following setting that points to the /home directory, which is not allowed to access on remote containers.
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hive/scratch/${user.name}</value>
</property>
You should remove it (the default points to /tmp) or update it so it points to a temporarily directory (e.g. /tmp/) that has write permission on all remote containers.
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive/scratch/${user.name}</value>
</property>
-
Make sure the HiveServer2 (or Hive ThriftServer2) Thrift properties and doAs are there. (If you completed Step 1 to 3, then the values are already provided and no need to change anything). For example:
<property>
<name>hive.metastore.uris</name>
<value>thrift://hiveserver-<customername>.s3s.altiscale.com:9083</value>
</property>
<property>
<name>hive.metastore.sasl.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
Tez Enabled Cluster
This is only applicable if the Tez engine is the global default engine for Hive on your cluster.
If Tez is enabled as the default engine on your cluster, you will need to perform the following update to your configuration.
Your configuration should include the following statement to support Tez ONLY if Tez is enabled due to Spark reuse of the code base from HiveServer2. This is only required if you are trying to use Spark ThriftServer that utilizes the HiveServer2 code base and requires these settings when it interacts with Hive.
You may want to create a script to do this. You can copy and paste the following information to a Bash script to launch your own Spark ThriftServer2.
cd $SPARK_HOME/sbin
hive_tez_jars=""
if [ -f /etc/tez/tez-site.xml ] ; then
hive_tez_jars=$(find /opt/tez/ -type f -name "*.jar" | tr -s "\n" "," | sed 's/,$//')
fi
hadoop_snappy_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "snappy-java-*.jar")
hadoop_lzo_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "hadoop-lzo-*.jar")
spark_opts_extra="$spark_opts_extra --jars $hadoop_lzo_jar,$hadoop_snappy_jar,$hive_tez_jars"
spark_event_log_dir=$(grep 'spark.eventLog.dir' /etc/spark/spark-defaults.conf | tr -s ' ' '\t' | cut -f2)
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=$spark_ts2_listen_port --hiveconf hive.server2.thrift.bind.host=$(hostname)
--master yarn-client --executor-memory 1G --num-executors 4 --executor-cores 2 --driver-memory 1G
--conf spark.locality.wait=10000 --conf spark.shuffle.manager=sort --conf spark.eventLog.dir=${spark_event_log_dir}/$USER/ $spark_opts_extra
The bottom line is that the hive_tez_jars
list needs to be included in your classpath when you invoke Spark ThriftServer2.
/etc/spark/hive-site.xml
to include /etc/tez/tez-site.xml
via XPath.<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="/etc/tez/tez-site.xml">
<xi:fallback>
</xi:fallback>
</xi:include>
Examples
The following examples will be using your Workbench. Beware that the option --driver-memory will take up memory on your Workbench where you run these commands, and may interfere with other processes on the Workbench.
spark-sql
Try spark-sql. spark-sql commands run like a spark job. So it only supports local
mode (not recommended) and yarn-client
mode. Here, we prefer yarn-client
mode.
cd $SPARK_HOME/bin/
hadoop_snappy_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "snappy-java-*.jar")
hadoop_lzo_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "hadoop-lzo-*.jar")
spark_opts_extra="$spark_opts_extra --jars $hadoop_lzo_jar,$hadoop_snappy_jar,$hive_tez_jars"
spark_event_log_dir=$(grep 'spark.eventLog.dir' /etc/spark/spark-defaults.conf | tr -s ' ' '\t' | cut -f2)
./spark-sql --driver-memory 1G --master yarn --deploy-mode client --executor-cores 4 --num-executors 8 --executor-memory 2G --conf spark.eventLog.dir=${spark_event_log_dir}$USER/ $spark_opts_extra
Known Issue:
https://issues.apache.org/jira/browse/SPARK-6014
SQLContext and HiveContext API
Run the test case test_spark_sql.sh
and test_spark_hql.sh
in /opt/spark/test_spark/
and you should see the results. If Kerberos is enabled, you will need to run the test_spark_hql.kerberos.sh instead.
Spark ThriftServer2 (Spark TS2)
You will only need to run this if you are trying to connect your BI application (e.g. Tableau) to utilize SparkSQL engine for your SQL query. Spark ThriftServer2 is like another Spark join, it inherits a Spark job's configuration.
For example, a complicated command with a lot of tuning (which is required) may look as shown below.
The $(hostname)
is required as a parameter in the SSH tunneling afterward since it will only bind to a certain network interface on your Workbench.
sudo su - spark
cd /opt/spark/
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=20000 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client --queue production --executor-memory 4G --num-executors 8 --executor-cores 4 --driver-memory 1G --conf spark.kryoserializer.buffer.mb=128 --conf spark.locality.wait=10000 --conf spark.shuffle.manager=sort --conf spark.shuffle.consolidateFiles=true=true --conf spark.rdd.compress=true --conf spark.storage.memoryFraction=0.6 --conf spark.sql.inMemoryColumnarStorage.compressed=true --conf spark.sql.inMemoryColumnarStorage.batchSize=1012000
The following shows how to tunnel to Spark ThriftServer2 that is listening on $(hostname)
derived from your Workbench.
The <HOSTNAME>
is the hostname value from command hostname on the system where you start the Spark ThriftServer2.
ssh -L 20000:<HOSTNAME>:20000 workbench
Now, you can access Spark ThriftServer2 with your BI application using localhost:20000
. If you are using an ODBC Driver, the port number will be 20000 now instead of the default Hive Server2 port which is 10000.
Fuente: http://documentation.altiscale.com/spark-sql-how-to
Redes sociales