redshift에서 데이터 프레임을 저장할 수 없습니다 (Unable to save dataframe in redshift)


문제 설명

redshift에서 데이터 프레임을 저장할 수 없습니다 (Unable to save dataframe in redshift)

큰 데이터 세트 형식의 hdfs 위치를 읽고 내 데이터 프레임을 redshift에 저장하고 있습니다.

df.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
  .option("dbtable", "my_table_copy")
  .option("tempdir", "s3n://path/for/temp/data")
  .mode("error")
  .save()

시간이 지나면 다음 오류가 발생합니다.

s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:

github에서 동일한 문제를 발견했습니다.

s3.amazonaws.com:443 실패 응답

내가 뭔가 잘못하고 있습니까? 도와주세요


참조 솔루션

방법 1:

I had the same issue in my case I was using AWS EMR too.

Redshift databricks library using the Amazon S3 for efficiently transfer data in and out of RedshiftSpark.This library firstly write the data in Amazon S3 and than this avro files loaded into Redshift using EMRFS.

You have to configure your EMRFS setting and it will be work.

The EMR File System (EMRFS) and the Hadoop Distributed File System (HDFS) are both installed on your EMR cluster. EMRFS is an implementation of HDFS which allows EMR clusters to store data on Amazon S3.

EMRFS will try to verify list consistency for objects tracked in its metadata for a specific number of retries(emrfs‑retry‑logic). The default is 5. In the case where the number of retries is exceeded the originating job returns a failure. To overcome this issue you can override your default emrfs configuration in the following steps:

Step1: Login your EMR‑master instance

Step2: Add following properties to /usr/share/aws/emr/emrfs/conf/emrfs‑site.xml

sudo vi /usr/share/aws/emr/emrfs/conf/emrfs‑site.xml fs.s3.consistent.throwExceptionOnInconsistency false

<property>
    <name>fs.s3.consistent.retryPolicyType</name>
    <value>fixed</value>
</property>
<property>
    <name>fs.s3.consistent.retryPeriodSeconds</name>
    <value>10</value>
</property>
<property>
    <name>fs.s3.consistent</name>
    <value>false</value>
</property>

And restart your EMR cluster

and also configure your hadoopConfiguration hadoopConf.set("fs.s3a.attempts.maximum", "30")

val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)

(by Neelam BhargavaGabber)

참조 문서

  1. Unable to save dataframe in redshift (CC BY‑SA 2.5/3.0/4.0)

#amazon-redshift #apache-spark #amazon-s3 #scala






관련 질문

AWS Redshift JDBC 삽입 성능 (AWS Redshift JDBC insert performance)

데이터 웨어하우스에는 어떤 종류의 데이터가 저장됩니까? (What kind of data gets stored in data warehouses?)

임시 자격 증명을 사용하여 Redshift COPY 명령을 실행하는 동안 액세스 거부 오류가 발생했습니다. (Access denied error while runnig Redshift COPY command using Temp credential)

Firebase에서 Amazon Redshift로 데이터 로드 (Load data from firebase to amazon redshift)

PL/pgsql DDL을 작성하여 redshift에서 스키마를 생성한 다음 ddls를 반복하여 각 스키마에 테이블을 생성하는 방법은 무엇입니까? (How to write PL/pgsql DDL to create schemas in redshift and then looping the ddls to create tables in the respective schemas?)

redshift에서 데이터 프레임을 저장할 수 없습니다 (Unable to save dataframe in redshift)

Redshift에서 id가 일련의 값보다 작은 행의 쿼리 수 (query count of rows where id is less than a series of values in Redshift)

[Amazon](500310) 잘못된 작업: "$$ 또는 그 근처에서 종료되지 않은 달러 인용 문자열 ([Amazon](500310) Invalid operation: unterminated dollar-quoted string at or near "$$)

Redshift JDBC DatabaseMetaData.getDatabaseMajorVersion()이 최신 값을 반환합니까? (Does the Redshift JDBC DatabaseMetaData.getDatabaseMajorVersion() return an up to date value?)

Where 절을 무시하는 Redshift 교차 조인 (Redshift Cross join ignoring where clause)

AWS Redshift는 RECORD에서 열 이름을 동적으로 선택합니다. (AWS Redshift dynamically select column name from RECORD)

여러 열을 기반으로 중복을 제거하고 하나의 고유한 레코드를 선택하도록 조건을 설정합니다. (Remove duplicates based on multiple columns and set conditions to choose one unique record)







코멘트