Foro Formación Hadoop

Uso de SparkSQL contra Hive

 
Imagen de Admin Formación Hadoop
Uso de SparkSQL contra Hive
de Admin Formación Hadoop - martes, 6 de octubre de 2015, 15:31
 

Using SparkSQL

This section will show you how to enable Spark SQL on your Altiscale cluster with Spark. A few tweaks are required (and mostly are already performed by Altiscale) to SparkSQL to interact with the Hive metastore. 

SparkSQL is like a Spark job; it inherits all the Spark job characteristics including eventLog directories, JVM settings, etc. — all the configuration defined in /etc/spark/.

The following sections provide a few test cases on how to invoke SparkSQL commands and SQLContext and HiveContext APIs. Please note you'll need to include a pointer to the Hadoop LZO jar in your command line invocation as shown in the examples.

All examples are available under /opt/spark/test_spark/ in the Altiscale distribution.

Configuration

So first, you will need to enable SparkSQL on your cluster. To enable SparkSQL, currently, it is a bit manual, however, these manual step will be automated soon.

If you haven't installed Spark on your Workbench, you should follow the instructions at Installing Spark 1.3 or Installing Spark 1.4.  If you have an older version of Spark, you will need to update /etc/spark/spark-env.sh.

  1. Copy /etc/hive/hive-site.xml to /etc/spark/. This will be the global Hive configuration applied in Spark. If you need to customize your own Spark configuration, you may create your own configuration directory and point Spark to it when you submit your own Spark jobs.

    cd /etc/spark/
    cp /etc/hive/hive-site.xml .
    # Strip off any password or credentials here if applicable, we don't need it.
    vim hive-site.xml
    chown spark:spark hive-site.xml
    chmod 555 hive-site.xml
    # You should see the following permission in /etc/spark/
    ls -al hive-site.xml
    -r-xr-xr-x 1 spark spark 2053 Feb 12 02:56 hive-site.xml
 
  1. Remove the four javax.jdo.option.Connection<xxx> properties from the JDO properties in hive-site.xml if they appear in your hive-site.xml file.  If you originally copied your /etc/spark/hive-site.xml from /etc/hive-metastore/hive-site.xml (as instructed in previous versions of this document) then these properties will be present and should be removed.  If you copied from /etc/hive/hive-site.xml, as specifed in step 1 above, then these properties will not be present. 
    (WARNING: The values shown here are examples, your hive-site.xml may vary).

<property> 
<name>javax.jdo.option.ConnectionURL</name> 
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> 
</property> 
<property> 
<name>javax.jdo.option.ConnectionUserName</name> 
<value>hive</value> 
</property> 
<property> 
<name>javax.jdo.option.ConnectionDriverName</name> 
<value>com.mysql.jdbc.Driver</value> 
</property> 

<property> 
<name>javax.jdo.option.ConnectionPassword</name> 
<value>**************</value> 
</property>
 
  1. Make sure the hive.server2.authentication property value is set to NONE.

    <property>
      <name>hive.server2.authentication</name>
      <value>NONE</value>
    </property>
     
  2. Remove/comment out all local logging properties that point to /home, such as scratch directories settings that won't work on remote containers (e.g. in yarn-cluster mode).

For example, you may see the following setting that points to the /home directory, which is not allowed to access on remote containers.

<property> 
  <name>hive.exec.local.scratchdir</name> 
  <value>/home/hive/scratch/${user.name}</value> 
</property>


You should remove it (the default points to /tmp) or update it so it points to a temporarily directory (e.g. /tmp/) that has write permission on all remote containers.

<property> 
  <name>hive.exec.local.scratchdir</name> 
  <value>/tmp/hive/scratch/${user.name}</value> 
</property>
 
  1. Make sure the HiveServer2 (or Hive ThriftServer2) Thrift properties and doAs are there. (If you completed Step 1 to 3, then the values are already provided and no need to change anything).  For example:

<property> 
  <name>hive.metastore.uris</name> 
  <value>thrift://hiveserver-<customername>.s3s.altiscale.com:9083</value> 
</property> 
<property> 
  <name>hive.metastore.sasl.enabled</name> 
  <value>false</value> 
</property> 
<property> 
  <name>hive.server2.authentication</name> 
  <value>NONE</value> 
</property> 
<property> 
  <name>hive.server2.enable.doAs</name> 
  <value>true</value> 
</property>
 

Tez Enabled Cluster

This is only applicable if the Tez engine is the global default engine for Hive on your cluster.

If Tez is enabled as the default engine on your cluster, you will need to perform the following update to your configuration.

Your configuration should include the following statement to support Tez ONLY if Tez is enabled due to Spark reuse of the code base from HiveServer2. This is only required if you are trying to use Spark ThriftServer that utilizes the HiveServer2 code base and requires these settings when it interacts with Hive.

You may want to create a script to do this. You can copy and paste the following information to a Bash script to launch your own Spark ThriftServer2.

cd $SPARK_HOME/sbin
hive_tez_jars="" 
if [ -f /etc/tez/tez-site.xml ] ; then 
  hive_tez_jars=$(find /opt/tez/ -type f -name "*.jar" | tr -s "\n" "," | sed 's/,$//') 
fi 

hadoop_snappy_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "snappy-java-*.jar")
hadoop_lzo_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "hadoop-lzo-*.jar")
spark_opts_extra="$spark_opts_extra --jars $hadoop_lzo_jar,$hadoop_snappy_jar,$hive_tez_jars"
spark_event_log_dir=$(grep 'spark.eventLog.dir' /etc/spark/spark-defaults.conf | tr -s ' ' '\t' | cut -f2)

./start-thriftserver.sh --hiveconf hive.server2.thrift.port=$spark_ts2_listen_port --hiveconf hive.server2.thrift.bind.host=$(hostname)
--master yarn-client --executor-memory 1G --num-executors 4 --executor-cores 2 --driver-memory 1G --conf spark.locality.wait=10000 --conf spark.shuffle.manager=sort --conf spark.eventLog.dir=${spark_event_log_dir}/$USER/ $spark_opts_extra


The bottom line is that the hive_tez_jars list needs to be included in your classpath when you invoke Spark ThriftServer2.

 
ONLY IF Tez is ENABLED, add the following to /etc/spark/hive-site.xml to include /etc/tez/tez-site.xml via XPath.
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="/etc/tez/tez-site.xml"> 
  <xi:fallback> 
  </xi:fallback> 
</xi:include>
 
 

Examples

The following examples will be using your Workbench. Beware that the option --driver-memory will take up memory on your Workbench where you run these commands, and may interfere with other processes on the Workbench.

spark-sql

Try spark-sql. spark-sql commands run like a spark job. So it only supports local mode (not recommended) and yarn-client mode. Here, we prefer yarn-client mode.

cd $SPARK_HOME/bin/
hadoop_snappy_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "snappy-java-*.jar")
hadoop_lzo_jar=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name "hadoop-lzo-*.jar")
spark_opts_extra="$spark_opts_extra --jars $hadoop_lzo_jar,$hadoop_snappy_jar,$hive_tez_jars"
spark_event_log_dir=$(grep 'spark.eventLog.dir' /etc/spark/spark-defaults.conf | tr -s ' ' '\t' | cut -f2)


./spark-sql --driver-memory 1G --master yarn --deploy-mode client --executor-cores 4 --num-executors 8 --executor-memory 2G --conf spark.eventLog.dir=${spark_event_log_dir}$USER/ $spark_opts_extra
 

Known Issue:
https://issues.apache.org/jira/browse/SPARK-6014

 

SQLContext and HiveContext API

Run the test case test_spark_sql.sh and test_spark_hql.sh in /opt/spark/test_spark/ and you should see the results. If Kerberos is enabled, you will need to run the test_spark_hql.kerberos.sh instead.

 

Spark ThriftServer2 (Spark TS2)

You will only need to run this if you are trying to connect your BI application (e.g. Tableau) to utilize SparkSQL engine for your SQL query.  Spark ThriftServer2 is like another Spark join, it inherits a Spark job's configuration.

This feature is still experimental and not production ready.

For example, a complicated command with a lot of tuning (which is required) may look as shown below.

Be aware, the following example takes up a 32GB of cluster memory and 8 containers and 1GB of memory on your Workbench.

The $(hostname) is required as a parameter in the SSH tunneling afterward since it will only bind to a certain network interface on your Workbench.

sudo su - spark
cd /opt/spark/
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=20000 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client --queue production --executor-memory 4G --num-executors 8 --executor-cores 4 --driver-memory 1G --conf spark.kryoserializer.buffer.mb=128 --conf spark.locality.wait=10000 --conf spark.shuffle.manager=sort --conf spark.shuffle.consolidateFiles=true=true --conf spark.rdd.compress=true --conf spark.storage.memoryFraction=0.6 --conf spark.sql.inMemoryColumnarStorage.compressed=true --conf spark.sql.inMemoryColumnarStorage.batchSize=1012000


The following shows how to tunnel to Spark ThriftServer2 that is listening on $(hostname) derived from your Workbench.

The <HOSTNAME> is the hostname value from command hostname on the system where you start the Spark ThriftServer2.

ssh -L 20000:<HOSTNAME>:20000 workbench

Now, you can access Spark ThriftServer2 with your BI application using localhost:20000. If you are using an ODBC Driver, the port number will be 20000 now instead of the default Hive Server2 port which is 10000.

Fuente: http://documentation.altiscale.com/spark-sql-how-to