0

I am trying to write a dataframe in SQL table which is having some extended ASCII characters like 'em-dash', pound symbol. But same is not getting written in DB table, showing some junk characters instead of pound symbol or em-dash.

Is there any alternative way to write down the data in SQL table without corrupting the data. I have used "useUnicode=true&characterEncoding=utf8&characterSetResults=utf8" in my JDBC URL and also used "DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;" in my create table schema. But the issue still persists!

JDBC URL: jdbc:mysql://localhost:3306/MySchema?zeroDateTimeBehavior=convertToNull&allowPublicKeyRetrieval=true&useSSL=false&serverTimezone=UTC&useUnicode=true&characterEncoding=utf8&characterSetResults=utf8

Sample Code:

    val handleSpecialChar = udf {(make: String) => new String(make.getBytes("UTF-8"))}

    val test_df = Seq(
      ("Row 1", "Hello–World"),
      ("Row 2", "Hello—World"),
      ("Row 3", "Hello˜World"),
      ("Row 4", "Hello•World"),
      ("Row 5", "Hello™World"),
    )

    var test = sparkSession.createDataFrame(test_df).toDF("id", "test_string")
    test.show(false)

    test = test.withColumn("test_string", handleSpecialChar(test("test_string")))
    test.show(false)


    test.coalesce(1).write.mode(SaveMode.Overwrite).option("truncate", "true")
      .jdbc(dbUrlMis, "MySchema.testdata", misConnectionProperties)

Create statement of testdata table:

 CREATE TABLE mySchema.`testdata` (
   `id` varchar(200) DEFAULT NULL,
   `test_string` varchar(200) DEFAULT NULL,
   KEY (`id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;

Sample database table output after writing: (https://i.sstatic.net/VbnwEOth.jpg) Sample intellij console output after executing test.show() : (https://i.sstatic.net/zEyRyg5n.jpg)

1

0

Browse other questions tagged or ask your own question.