I am trying to write a dataframe in SQL table which is having some extended ASCII characters like 'em-dash'
, pound
symbol. But same is not getting written in DB table, showing some junk characters instead of pound symbol or em-dash.
Is there any alternative way to write down the data in SQL table without corrupting the data. I have used "useUnicode=true&characterEncoding=utf8&characterSetResults=utf8"
in my JDBC URL and also used "DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;"
in my create table schema. But the issue still persists!
JDBC URL: jdbc:mysql://localhost:3306/MySchema?zeroDateTimeBehavior=convertToNull&allowPublicKeyRetrieval=true&useSSL=false&serverTimezone=UTC&useUnicode=true&characterEncoding=utf8&characterSetResults=utf8
Sample Code:
val handleSpecialChar = udf {(make: String) => new String(make.getBytes("UTF-8"))}
val test_df = Seq(
("Row 1", "Hello–World"),
("Row 2", "Hello—World"),
("Row 3", "Hello˜World"),
("Row 4", "Hello•World"),
("Row 5", "Hello™World"),
)
var test = sparkSession.createDataFrame(test_df).toDF("id", "test_string")
test.show(false)
test = test.withColumn("test_string", handleSpecialChar(test("test_string")))
test.show(false)
test.coalesce(1).write.mode(SaveMode.Overwrite).option("truncate", "true")
.jdbc(dbUrlMis, "MySchema.testdata", misConnectionProperties)
Create statement of testdata table:
CREATE TABLE mySchema.`testdata` (
`id` varchar(200) DEFAULT NULL,
`test_string` varchar(200) DEFAULT NULL,
KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;
Sample database table output after writing: (https://i.sstatic.net/VbnwEOth.jpg) Sample intellij console output after executing test.show() : (https://i.sstatic.net/zEyRyg5n.jpg)