From c99ab54ff07bbbc30e8c096727d813912d2ac0f0 Mon Sep 17 00:00:00 2001 From: Maxim Martynov Date: Sat, 16 Dec 2023 22:58:53 +0300 Subject: [PATCH] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 84e98447..39cdb20c 100644 --- a/README.md +++ b/README.md @@ -47,12 +47,12 @@ When reading files the API accepts several options: * `inferSchema`: if `true`, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If `false`, all resulting columns are of string type. Default is `true`. * `columnNameOfCorruptRecord`: The name of new field where malformed strings are stored. Default is `_corrupt_record`. - Note: you should explicitly add `_corrupt_record` field to dataframe schema, like this: + Note: if you pass `schema` explicitly, you should add `_corrupt_record` field to the schema, like this: ```python schema = StructType([StructField("my_field", TimestampType()), StructField("_corrupt_record", StringType())]) spark.read.format("xml").options(rowTag='item').schema(schema).load("file.xml") ``` - Otherwise the corrupt record will lead to creating row with all `null` fields, and you cannot access the original xml string. + Otherwise the parsing corrupt record will lead to creating row with all `null` fields, and you cannot access the original xml string. * `attributePrefix`: The prefix for attributes so that we can differentiate attributes and elements. This will be the prefix for field names. Default is `_`. Can be empty, but only for reading XML. * `valueTag`: The tag used for the value when there are attributes in the element having no child. Default is `_VALUE`. * `charset`: Defaults to 'UTF-8' but can be set to other valid charset names