Foro Formación Hadoop

Ejercicio de Spark

 
Imagen de MIGUEL OROPEZA
Ejercicio de Spark
de MIGUEL OROPEZA - viernes, 23 de febrero de 2018, 19:53
 

Hola.

Cuando trato de escribir una aplicación Spark en Pynthon.

import sys
from pyspark import SparkContext
from pyspark import SparkConf

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print >> sys.stderr, "Usage: CountJPGs <file>"
        exit(-1)

    sc = SparkContext()
    # Challenge: Configure app name and UI port programatically
    # sconf = SparkConf().setAppName("My Spark App")
    # sc = SparkContext(conf=sconf)
    logfile = sys.argv[1]
    count = sc.textFile("file:/home/cloudera/formacionhadood/weblogs/2014-02-02.log").filter(lambda line: '.jpg' in line).count()
    print "Number of JPG requests: ", count


Me aparece este error:

[cloudera@quickstart mispracticas]$ spark-submit mipru1.py /home/cloudera/mispracticas
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.12.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/02/23 10:49:53 INFO spark.SparkContext: Running Spark version 1.6.0
18/02/23 10:49:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/23 10:49:57 INFO spark.SecurityManager: Changing view acls to: cloudera
18/02/23 10:49:57 INFO spark.SecurityManager: Changing modify acls to: cloudera
18/02/23 10:49:57 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
18/02/23 10:49:59 INFO util.Utils: Successfully started service 'sparkDriver' on port 49498.
18/02/23 10:50:01 INFO slf4j.Slf4jLogger: Slf4jLogger started
18/02/23 10:50:01 INFO Remoting: Starting remoting
18/02/23 10:50:02 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.174.130:54972]
18/02/23 10:50:02 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriverActorSystem@192.168.174.130:54972]
18/02/23 10:50:02 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 54972.
18/02/23 10:50:03 INFO spark.SparkEnv: Registering MapOutputTracker
18/02/23 10:50:03 INFO spark.SparkEnv: Registering BlockManagerMaster
18/02/23 10:50:03 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-17196377-85f9-4a11-a937-64d45a823c69
18/02/23 10:50:03 INFO storage.MemoryStore: MemoryStore started with capacity 534.5 MB
18/02/23 10:50:03 INFO spark.SparkEnv: Registering OutputCommitCoordinator
18/02/23 10:50:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/02/23 10:50:04 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
18/02/23 10:50:04 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
18/02/23 10:50:05 INFO ui.SparkUI: Started SparkUI at http://192.168.174.130:4040
18/02/23 10:50:08 INFO util.Utils: Copying /home/cloudera/mispracticas/mipru1.py to /tmp/spark-2c84e57f-89f7-46b5-8d4c-4328022a600f/userFiles-bf05c582-b3e4-4cf1-ac4d-f5325163585f/mipru1.py
18/02/23 10:50:08 INFO spark.SparkContext: Added file file:/home/cloudera/mispracticas/mipru1.py at file:/home/cloudera/mispracticas/mipru1.py with timestamp 1519411808226
18/02/23 10:50:08 INFO executor.Executor: Starting executor ID driver on host localhost
18/02/23 10:50:09 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40743.
18/02/23 10:50:09 INFO netty.NettyBlockTransferService: Server created on 40743
18/02/23 10:50:09 INFO storage.BlockManagerMaster: Trying to register BlockManager
18/02/23 10:50:09 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:40743 with 534.5 MB RAM, BlockManagerId(driver, localhost, 40743)
18/02/23 10:50:09 INFO storage.BlockManagerMaster: Registered BlockManager
18/02/23 10:50:13 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 203.0 KB, free 534.3 MB)
18/02/23 10:50:13 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.1 KB, free 534.3 MB)
18/02/23 10:50:13 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40743 (size: 24.1 KB, free: 534.5 MB)
18/02/23 10:50:13 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
18/02/23 10:50:16 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/02/23 10:50:16 INFO mapred.FileInputFormat: Total input paths to process : 1
18/02/23 10:50:17 INFO spark.SparkContext: Starting job: count at /home/cloudera/mispracticas/mipru1.py:15
18/02/23 10:50:17 INFO scheduler.DAGScheduler: Got job 0 (count at /home/cloudera/mispracticas/mipru1.py:15) with 1 output partitions
18/02/23 10:50:17 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at /home/cloudera/mispracticas/mipru1.py:15)
18/02/23 10:50:17 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/02/23 10:50:17 INFO scheduler.DAGScheduler: Missing parents: List()
18/02/23 10:50:17 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at count at /home/cloudera/mispracticas/mipru1.py:15), which has no missing parents
18/02/23 10:50:17 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.1 KB, free 534.3 MB)
18/02/23 10:50:17 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.8 KB, free 534.3 MB)
18/02/23 10:50:17 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:40743 (size: 3.8 KB, free: 534.5 MB)
18/02/23 10:50:17 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1004
18/02/23 10:50:17 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (PythonRDD[2] at count at /home/cloudera/mispracticas/mipru1.py:15) (first 15 tasks are for partitions Vector(0))
18/02/23 10:50:17 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
18/02/23 10:50:18 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 2212 bytes)
18/02/23 10:50:18 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
18/02/23 10:50:18 INFO executor.Executor: Fetching file:/home/cloudera/mispracticas/mipru1.py with timestamp 1519411808226
18/02/23 10:50:18 INFO util.Utils: /home/cloudera/mispracticas/mipru1.py has been previously copied to /tmp/spark-2c84e57f-89f7-46b5-8d4c-4328022a600f/userFiles-bf05c582-b3e4-4cf1-ac4d-f5325163585f/mipru1.py
18/02/23 10:50:18 INFO rdd.HadoopRDD: Input split: file:/home/cloudera/formacionhadood/weblogs/2014-02-02.log:0+1099940
18/02/23 10:50:18 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/02/23 10:50:18 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/02/23 10:50:18 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/02/23 10:50:18 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/02/23 10:50:18 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/02/23 10:50:20 INFO python.PythonRunner: Times: total = 1645, boot = 744, init = 99, finish = 802
18/02/23 10:50:20 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2125 bytes result sent to driver
18/02/23 10:50:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2478 ms on localhost (executor driver) (1/1)
18/02/23 10:50:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/23 10:50:20 INFO scheduler.DAGScheduler: ResultStage 0 (count at /home/cloudera/mispracticas/mipru1.py:15) finished in 2.931 s
18/02/23 10:50:20 INFO scheduler.DAGScheduler: Job 0 finished: count at /home/cloudera/mispracticas/mipru1.py:15, took 3.431908 s
Number of JPG requests:  449
18/02/23 10:50:20 INFO spark.SparkContext: Invoking stop() from shutdown hook
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
18/02/23 10:50:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
18/02/23 10:50:20 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.174.130:4040
18/02/23 10:50:20 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/23 10:50:20 INFO storage.MemoryStore: MemoryStore cleared
18/02/23 10:50:20 INFO storage.BlockManager: BlockManager stopped
18/02/23 10:50:21 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
18/02/23 10:50:21 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/23 10:50:21 INFO spark.SparkContext: Successfully stopped SparkContext
18/02/23 10:50:21 INFO util.ShutdownHookManager: Shutdown hook called
18/02/23 10:50:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-2c84e57f-89f7-46b5-8d4c-4328022a600f/pyspark-3ebb7eb3-a555-4b48-84b5-287d6e867111
18/02/23 10:50:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-2c84e57f-89f7-46b5-8d4c-4328022a600f
18/02/23 10:50:21 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
[cloudera@quickstart mispracticas]$

Saludos

 

Imagen de Admin Formación Hadoop
Re: Ejercicio de Spark
de Admin Formación Hadoop - sábado, 24 de febrero de 2018, 08:53
 

Buenos días Miguel,

No hay ningún ERROR. Son los registros de ejecución. Si te das cuenta, en las últimas líneas sale el "print" que has puesto en la aplicación:

Number of JPG requests:  449

 

Un saludo,

Imagen de MIGUEL OROPEZA
Re: Ejercicio de Spark
de MIGUEL OROPEZA - lunes, 26 de febrero de 2018, 15:42
 

Gracias , no lo había visto.

 

Saludos