docs/sql-data-sources-csv.md
Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.
Data source options of CSV can be set via:
.option/.options methods of
DataFrameReaderDataFrameWriterDataStreamReaderDataStreamWriterfrom_csvto_csvschema_of_csvOPTIONS clause at CREATE TABLE USING DATA_SOURCE<ul>
<li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length of the schema, sets <code>null</code> to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.</li>
<li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.</li>
<li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
</ul>
</td>
<td>read</td>
<ul>
<li><code>STOP_AT_CLOSING_QUOTE</code>: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.</li>
<li><code>BACK_TO_DELIMITER</code>: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.</li>
<li><code>STOP_AT_DELIMITER</code>: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter or a line ending is found in the input.</li>
<li><code>SKIP_VALUE</code>: If unescaped quotes are found in the input, the content parsed for the given value will be skipped and the value set in nullValue will be produced instead.</li>
<li><code>RAISE_ERROR</code>: If unescaped quotes are found in the input, a TextParsingException will be thrown.</li>
</ul>
</td>
<td>read</td>
<ul>
<li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
<li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
</ul>
Other short names like 'CST' are not recommended to use because they can be ambiguous.
</td>
<td>read/write</td>