-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libray used by FnF to create parquet file is different than spark uses. #305
Comments
@matteofigus : Need your expertise here please |
Hello, Just to understand your question better, is there any underlying context/issue that lead you to looking at this? i.e. unexpected changes to the data stored in the parquet file? |
@ctd : Thank you for your quick response. Yes, we observed a small issue(which does not impact us badly) . Since Spark 3 uses a different calendar as per below JIRA. All dates/timestamps before 1900 I believe are impacted https://issues.apache.org/jira/browse/SPARK-31404 We used the below workaround to read data through SPARK https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-3-upgrade?view=sql-server-ver15 (Please refer section "SparkUpgradeException due to calendar mode change") Hence, wanted to get your expert opinion if other issues like this might pop-up because parquet-mr s=vs parquet-arrow are 2 different libraries |
Hi @vivek-biradar @VAIBHAVTARANGE thanks for opening an issue. I am not very familiar with the scenario you mentioned, but I know that indeed manipulating date and times is risky due to compatibility issues, in fact we mention it in the production readiness docs: https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/docs/PRODUCTION_READINESS_GUIDELINES.md#4-run-your-test-queries Will there be any other issues? To be honest, I don't know but I think you are on the right path to find out. For each dataset my recommendation is to have a sample in a test account, perform a test deletion, and validate the schema of the output to ensure all systems you use to read are backward-compatible with the newly created object. After you perform the necessary testing, you can onboard the dataset in production. |
@matteofigus : Thank you for the reply. We did test this on our test environment and are running on production based on that. Our question was more from a futuristic perspective around whether pyarrow and parquet-mr will be in sync in terms of the parquet format as they are now. I know its a hard question, but if you could get expert recommendation within the AWS team(probably EMR folks). Again thank you for such quick response |
The Library used by FnF is parquet-cpp-arrow version 7.0.0 and
The library used by Spark is parquet-mr version 1.10.1.
Schema for timestamp is getting changed like below.
Pre FnF:-
############ Column(datetime) ############
name: datetime
path: datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
Post FnF:-
############ Column(datetime) ############
name: datetime
path: datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
Do you see any issues in the future if Spark gets newer versions?
The text was updated successfully, but these errors were encountered: