문제 설명
redshift에서 데이터 프레임을 저장할 수 없습니다 (Unable to save dataframe in redshift)
큰 데이터 세트 형식의 hdfs 위치를 읽고 내 데이터 프레임을 redshift에 저장하고 있습니다.
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
시간이 지나면 다음 오류가 발생합니다.
s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:
github에서 동일한 문제를 발견했습니다.
내가 뭔가 잘못하고 있습니까? 도와주세요
참조 솔루션
방법 1:
I had the same issue in my case I was using AWS EMR too.
Redshift databricks library using the Amazon S3 for efficiently transfer data in and out of RedshiftSpark.This library firstly write the data in Amazon S3 and than this avro files loaded into Redshift using EMRFS.
You have to configure your EMRFS setting and it will be work.
The EMR File System (EMRFS) and the Hadoop Distributed File System (HDFS) are both installed on your EMR cluster. EMRFS is an implementation of HDFS which allows EMR clusters to store data on Amazon S3.
EMRFS will try to verify list consistency for objects tracked in its metadata for a specific number of retries(emrfs‑retry‑logic). The default is 5. In the case where the number of retries is exceeded the originating job returns a failure. To overcome this issue you can override your default emrfs configuration in the following steps:
Step1: Login your EMR‑master instance
Step2: Add following properties to /usr/share/aws/emr/emrfs/conf/emrfs‑site.xml
sudo vi /usr/share/aws/emr/emrfs/conf/emrfs‑site.xml fs.s3.consistent.throwExceptionOnInconsistency false
<property>
<name>fs.s3.consistent.retryPolicyType</name>
<value>fixed</value>
</property>
<property>
<name>fs.s3.consistent.retryPeriodSeconds</name>
<value>10</value>
</property>
<property>
<name>fs.s3.consistent</name>
<value>false</value>
</property>
And restart your EMR cluster
and also configure your hadoopConfiguration hadoopConf.set("fs.s3a.attempts.maximum", "30")
val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
(by Neelam Bhargava、Gabber)