+ `;
+
+ return filter_html;
+}
+
+/**
+ * Make the result component given a minisearch result data object and the value of the search input as queryString.
+ * To view the result object structure, refer: https://lucaong.github.io/minisearch/modules/_minisearch_.html#searchresult
+ *
+ * @param {object} result
+ * @param {string} querystring
+ * @returns string
+ */
+function make_search_result(result, querystring) {
+ let search_divider = ``;
+ let display_link =
+ result.location.slice(Math.max(0), Math.min(50, result.location.length)) +
+ (result.location.length > 30 ? "..." : ""); // To cut-off the link because it messes with the overflow of the whole div
+
+ if (result.page !== "") {
+ display_link += ` (${result.page})`;
+ }
+
+ let textindex = new RegExp(`\\b${querystring}\\b`, "i").exec(result.text);
+ let text =
+ textindex !== null
+ ? result.text.slice(
+ Math.max(textindex.index - 100, 0),
+ Math.min(
+ textindex.index + querystring.length + 100,
+ result.text.length
+ )
+ )
+ : ""; // cut-off text before and after from the match
+
+ let display_result = text.length
+ ? "..." +
+ text.replace(
+ new RegExp(`\\b${querystring}\\b`, "i"), // For first occurrence
+ '$&'
+ ) +
+ "..."
+ : ""; // highlights the match
+
+ let in_code = false;
+ if (!["page", "section"].includes(result.category.toLowerCase())) {
+ in_code = true;
+ }
+
+ // We encode the full url to escape some special characters which can lead to broken links
+ let result_div = `
+
+
# assume I have csv text data encoded in ISO-8859-1 encoding
# I load the StringEncodings package, which provides encoding conversion functionality
using CSV, StringEncodings
@@ -161,7 +161,7 @@
# note this isn't required, but can be convenient in certain cases
file = CSV.File(IOBuffer(data); normalizenames=true)
-# we can acces the first column like
+# we can access the first column like
file._1
# another example where we may want to normalize is column names with spaces in them
@@ -351,7 +351,7 @@
# In this data, we have a few "quoted" fields, which means the field's value starts and ends with `quotechar` (or
# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise
# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise
-# signficant characters until the closing quote character is found. For quoted fields that need to also include the quote
+# significant characters until the closing quote character is found. For quoted fields that need to also include the quote
# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote
# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default
# values, so just doing `CSV.File(IOBuffer(data))` would result in successful parsing.
@@ -538,4 +538,4 @@
8CD2E,GC
"""
-file = CSV.File(IOBuffer(data); pool=(0.5, 2))
Settings
This document was generated with Documenter.jl version 0.27.25 on Thursday 10 August 2023. Using Julia version 1.9.2.
+file = CSV.File(IOBuffer(data); pool=(0.5, 2))
Settings
This document was generated with Documenter.jl version 1.2.1 on Wednesday 28 February 2024. Using Julia version 1.10.1.
To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:
CSV.File: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns a CSV.File object, which is like a lightweight table/DataFrame. Assuming file is a variable of a CSV.File object, individual columns can be accessed like file.col1, file[:col1], or file["col"]. You can see parsed column names via file.names. A CSV.File can also be iterated, where a CSV.Row is produced on each iteration, which allows access to each value in the row via row.col1, row[:col1], or row[1]. You can also index a CSV.File directly, like file[1] to return the entire CSV.Row at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated. CSV.File satisfies the Tables.jl "source" interface, and so can be passed to valid sink functions like DataFrame, SQLite.load!, Arrow.write, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.
CSV.read: a convenience function identical to CSV.File, but used when a CSV.File will be passed directly to a sink function, like a DataFrame. In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made. Keyword arguments are identical to CSV.File. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like CSV.File, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).
CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint than CSV.File. Supports many of the same options as CSV.File, except column type handling is a little different. By default, every column type will be essentially Union{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a CSV.Rows object, rows can be "streamed" by iterating, where each iteration produces a CSV.Row2 object, which operates similar to CSV.File's CSV.Row type where individual row values can be accessed via row.col1, row[:col1], or row[1]. If each row is processed individually, additional memory can be saved by passing reusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row. CSV.Rows also supports the Tables.jl interface and can also be passed to valid sink functions.
CSV.Chunks: similar to CSV.File, but allows passing a ntasks::Integer keyword argument which will cause the input file to be "chunked" up into ntasks number of chunks. After constructing a CSV.Chunks object, each iteration of the object will return a CSV.File of the next parsed chunk. Useful for processing extremely large files in "chunks". Because each iterated element is a valid Tables.jl "source", CSV.Chunks satisfies the Tables.partitions interface, so sinks that can process input partitions can operate by passing CSV.Chunks as the "source".
CSV.write: A valid Tables.jl "sink" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out a DataFrame, for example, it's simply CSV.write("data.csv", df), or to write out a matrix, it's using Tables; CSV.write("data.csv", Tables.table(mat))
CSV.RowWriter: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.
Want to iterate an input table and produce a single csv string per row? CSV.RowWriter.
For the rest of the manual, we're going to have two big sections, Reading and Writing where we'll walk through the various options to CSV.File/CSV.read/CSV.Rows/CSV.Chunks and CSV.write/CSV.RowWriter.
To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:
CSV.File: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns a CSV.File object, which is like a lightweight table/DataFrame. Assuming file is a variable of a CSV.File object, individual columns can be accessed like file.col1, file[:col1], or file["col"]. You can see parsed column names via file.names. A CSV.File can also be iterated, where a CSV.Row is produced on each iteration, which allows access to each value in the row via row.col1, row[:col1], or row[1]. You can also index a CSV.File directly, like file[1] to return the entire CSV.Row at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated. CSV.File satisfies the Tables.jl "source" interface, and so can be passed to valid sink functions like DataFrame, SQLite.load!, Arrow.write, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.
CSV.read: a convenience function identical to CSV.File, but used when a CSV.File will be passed directly to a sink function, like a DataFrame. In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made. Keyword arguments are identical to CSV.File. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like CSV.File, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).
CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint than CSV.File. Supports many of the same options as CSV.File, except column type handling is a little different. By default, every column type will be essentially Union{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a CSV.Rows object, rows can be "streamed" by iterating, where each iteration produces a CSV.Row2 object, which operates similar to CSV.File's CSV.Row type where individual row values can be accessed via row.col1, row[:col1], or row[1]. If each row is processed individually, additional memory can be saved by passing reusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row. CSV.Rows also supports the Tables.jl interface and can also be passed to valid sink functions.
CSV.Chunks: similar to CSV.File, but allows passing a ntasks::Integer keyword argument which will cause the input file to be "chunked" up into ntasks number of chunks. After constructing a CSV.Chunks object, each iteration of the object will return a CSV.File of the next parsed chunk. Useful for processing extremely large files in "chunks". Because each iterated element is a valid Tables.jl "source", CSV.Chunks satisfies the Tables.partitions interface, so sinks that can process input partitions can operate by passing CSV.Chunks as the "source".
CSV.write: A valid Tables.jl "sink" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out a DataFrame, for example, it's simply CSV.write("data.csv", df), or to write out a matrix, it's using Tables; CSV.write("data.csv", Tables.table(mat))
CSV.RowWriter: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.
Want to iterate an input table and produce a single csv string per row? CSV.RowWriter.
For the rest of the manual, we're going to have two big sections, Reading and Writing where we'll walk through the various options to CSV.File/CSV.read/CSV.Rows/CSV.Chunks and CSV.write/CSV.RowWriter.
The format for this section will go through the various inputs/options supported by CSV.File/CSV.read, with notes about compatibility with the other reading functionality (CSV.Rows, CSV.Chunks, etc.).
A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.
Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let's look at the various supported input types:
File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.
Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))
IO or Cmd: you can pass an IO or Cmd directly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, pass buffer_in_memory=true.
For files from the web, you can call HTTP.get(url).body to request the file, then access the data as a Vector{UInt8} from the body field, which can be passed directly for parsing. For Julia 1.6+, you can also use the Downloads stdlib, like Downloads.download(url) which can be passed to parsing
The header keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1. Alternative valid aguments for header include:
Integer, e.g. header=2: provide the row number as an Integer where the column names can be found
Bool, e.g. header=false: no column names exist in the data; column names will be auto-generated depending on the # of columns, like Column1, Column2, etc.
Vector{String} or Vector{Symbol}: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of the Vector will be made and converted to Vector{Symbol}
AbstractVector{<:Integer}: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names
Controls whether column names will be "normalized" to valid Julia identifiers. By default, this is false. If normalizenames=true, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty, like file.col1. The identifier that comes after the . must be valid, so spaces or identifiers starting with numbers aren't allowed.
An Integer can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false, or column names are provided manually as Vector{String} or Vector{Symbol}, the data is assumed to start on row 1, i.e. skipto=1.
An Integer argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip # of rows have been parsed, then parsing the entire file, stopping at the newly adjusted "end of file".
If transpose=true is passed, data will be read "transposed", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a "long" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.
A String argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header, skipto, or footerskip arguments, it should be noted that commented rows, while ignored, still count as "rows" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6, even if row 5 is a commented row and will be ignored.
This argument specifies whether "empty rows", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false, then for an empty row, all existing columns will have missing assigned to their value for that row. Similar to commented rows, empty rows also still count as "rows" when any of the header, skipto, or footerskip arguments are provided.
Arguments that control which columns from the input data will actually be parsed and available after processing. select controls which columns will be accessible after parsing while drop controls which columns to ignore. Either argument can be provided as a vector of Integer, String, or Symbol, specifing the column numbers or names to include/exclude. A vector of Bool matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.
An Integer argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit number of rows. To ensure an exact limit on larger files, also pass ntasks=1 to force single-threaded parsing.
For large enough data inputs, ntasks controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads(), which is the number of threads the julia process was started with, either via julia -t N or the JULIA_NUM_THREADS environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1. This argument is only applicable to CSV.File, not CSV.Rows. For CSV.Chunks, it controls the total number of chunk iterations a large file will be split up into for parsing.
When input data is large enough, parsing will attempt to "chunk" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk's first row is found, each parser will check rows_to_check number of rows to ensure the expected number of columns are present.
NOTE: only applicable to vector of inputs passed to CSV.File
A Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Argument to control how missing values are handled while parsing input data. The default is missingstring="", which means two consecutive delimiters, like ,,, will result in a cell being set as a missing value. Otherwise, you can pass a single string to use as a "sentinel", like missingstring="NA", or a vector of strings, where each will be checked for when parsing, like missingstring=["NA", "NAN", "NULL"], and if any match, the cell will be set to missing. By passing missingstring=nothing, no missing values will be checked for while parsing.
A Char or String argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,) if no other delimiter can be detected consistently.
A Bool argument, default false, that, if set to true, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.
A Bool argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false, so parsing avoids treating the quotechar or openquotechar/closequotechar arguments specially.
An ASCII Char argument (or arguments if both openquotechar and closequotechar are provided) that parsing uses to handle "quoted" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar, or start with openquotechar and end with closequotechar so parsing knows to treat the delim or newline as part of the cell value instead of as significant parsing characters. If the quotechar or closequotechar characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.
An ASCII Char argument that parsing uses when parsing quoted cells and the quotechar or closequotechar characters appear in a cell string value. If the escapechar character is encountered inside a quoted cell, it will be "skipped", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar is not included in the value of the cell, but is ignored completely.
A String or AbstractDict argument that controls how parsing detects datetime values in the data input. As a single String (or DateFormat) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time, Date, or DateTime). By default, if no dateformat argument is explicitly provided, parsing will try to detect any of Time, Date, or DateTime types following the standard Dates.ISOTimeFormat, Dates.ISODateFormat, or Dates.ISODateTimeFormat formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict is provided, different dateformat strings can be provided for specific columns; the provided dict can map either an Integer for column number or a String, Symbol or Regex for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
An ASCII Char argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14, the '.' character separates the 3 and 14 values, whereas for 3,14 (common European notation), the ',' character separates the fractional portion. By default, decimal='.'.
A "groupmark" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark, if provided, must be an ASCII Char which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729 is ',' and the groupmark for the US social security number 875-39-3196 is -. By default, groupmark=nothing which indicates that there are no stray characters separating digits.
These arguments can be provided as Vector{String} to specify custom values that should be treated as the Booltrue/false values for all the columns of a data input. By default, ["true", "True", "TRUE", "T", "1"] string values are used to detect true values, and ["false", "False", "FALSE", "F", "0"] string values are used to detect false values. Note that even though "1" and "0"can be used to parse true/false values, in terms of auto detecting column types, those values will be parsed as Int64 first, instead of Bool. To instead parse those values as Bools for a column, you can manually provide that column's type as Bool (see the type argument).
Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type, an AbstractVector of types, an AbstractDict, or a function.
If a single type is provided, like types=Float64, then all columns in the data input will be parsed as Float64. If a column's value isn't a valid Float64 value, then a warning will be emitted, unless silencewarnings=false is passed, then no warning will be printed. However, if strict=true is passed, then an error will be thrown instead, regarldess of the silencewarnings argument.
If a AbstractVector{Type} is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.
If an AbstractDict, then specific columns can have their column type specified with the key of the dict being an Integer for column number, or String or Symbol for column name or Regex matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.
If a function, then it should be of the form (i, name) -> Union{T, Nothing}, and will be applied to each detected column during initial parsing. Returning nothing from the function will result in the column's type being automatically detected during parsing.
By default types=nothing, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass types=Union{Float64, Missing} if the data input contains missing values. Parsing will detect missing values if present, and promote any manually provided column types from the singular (Float64) to the missing equivalent (Union{Float64, Missing}) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64, Float64, Date, DateTime, Time, Bool, String.
Non-standard types can be provided, like Dec64 from the DecFP.jl package, but must support the Base.tryparse(T, str) function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end, as long as a corresponding Base.tryparse definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';'))), where a single cell in the data input is like 1.23;4.56;7.89.
Note that the default stringtype can be overridden by providing a column's type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString), where the first column will be parsed as a String, while any other string columns will have the PosLenString type.
An AbstractDict{Type, Type} argument that allows replacing a non-String standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64, like typemap=IdDict(Int64 => Float64), which would cause any columns detected as Int64 to be parsed as Float64 instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String), which will cause any columns detected as Date to be parsed as String instead.
Argument that controls whether columns will be returned as PooledArrays. Can be provided as a Bool, Float64, Tuple{Float64, Int}, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}. As a Bool, controls absolutely whether a column will be pooled or not; if passed as a single Bool argument like pool=true, then all string columns will be pooled, regardless of cardinality. When passed as a Float64, the value should be between 0.0 and 1.0 to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1, then all string columns with a unique value % less than 10% will be returned as PooledArray, while other string columns will be normal string vectors. If pool is provided as a tuple, like (0.2, 500), the first tuple element is the same as a single Float64 value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500) means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the pool argument is a single Bool, Real, or Tuple{Float64, Int}, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool, Float64, or Tuple{Float64, Int}. Similar to the types argument, providing a vector to pool should have an element for each column in the data input, while a dict argument can map column number/name to Bool, Float64, or Tuple{Float64, Int} for specific columns. Unspecified columns will not be pooled when the argument is a dict.
A Bool argument that controls whether Integer detected column types will be "shrunk" to the smallest possible integer type. Argument is false by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.
An argument that controls the precise type of string columns. Supported values are InlineString (the default), PosLenString, or String. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:
InlineString: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than String. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.
PosLenString: results in columns returned as PosLenStringVector (or ChainedVector{PosLenStringVector} for the multithreaded case), which holds a reference to the original input data, and acts as one large "view" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns. PosLenStringVector, however, does not support traditional mutable operations like regular Vectors, like push!, append!, or deleteat!.
String: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normal Vector{String} (or ChainedVector{Vector{String}}), which can be processed normally, including any mutating operations.
Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing, emit a warning (i.e. silencewarnings=false and strict=false), but only up to 100 total warnings and then they'll be silenced (i.e. maxwarnings=100). If strict=true, then invalid values will result in an error being thrown instead of any warnings emitted.
A Bool argument that controls the printing of extra "debug" information while parsing. Can be useful if parsing doesn't produce the expected result or a bug is suspected in parsing somehow.
Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.
The format for this section will go through the various inputs/options supported by CSV.File/CSV.read, with notes about compatibility with the other reading functionality (CSV.Rows, CSV.Chunks, etc.).
A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.
Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let's look at the various supported input types:
File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.
Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))
IO or Cmd: you can pass an IO or Cmd directly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, pass buffer_in_memory=true.
For files from the web, you can call HTTP.get(url).body to request the file, then access the data as a Vector{UInt8} from the body field, which can be passed directly for parsing. For Julia 1.6+, you can also use the Downloads stdlib, like Downloads.download(url) which can be passed to parsing
The header keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1. Alternative valid augments for header include:
Integer, e.g. header=2: provide the row number as an Integer where the column names can be found
Bool, e.g. header=false: no column names exist in the data; column names will be auto-generated depending on the # of columns, like Column1, Column2, etc.
Vector{String} or Vector{Symbol}: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of the Vector will be made and converted to Vector{Symbol}
AbstractVector{<:Integer}: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names
Controls whether column names will be "normalized" to valid Julia identifiers. By default, this is false. If normalizenames=true, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty, like file.col1. The identifier that comes after the . must be valid, so spaces or identifiers starting with numbers aren't allowed.
An Integer can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false, or column names are provided manually as Vector{String} or Vector{Symbol}, the data is assumed to start on row 1, i.e. skipto=1.
An Integer argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip # of rows have been parsed, then parsing the entire file, stopping at the newly adjusted "end of file".
If transpose=true is passed, data will be read "transposed", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a "long" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.
A String argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header, skipto, or footerskip arguments, it should be noted that commented rows, while ignored, still count as "rows" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6, even if row 5 is a commented row and will be ignored.
This argument specifies whether "empty rows", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false, then for an empty row, all existing columns will have missing assigned to their value for that row. Similar to commented rows, empty rows also still count as "rows" when any of the header, skipto, or footerskip arguments are provided.
Arguments that control which columns from the input data will actually be parsed and available after processing. select controls which columns will be accessible after parsing while drop controls which columns to ignore. Either argument can be provided as a vector of Integer, String, or Symbol, specifying the column numbers or names to include/exclude. A vector of Bool matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.
An Integer argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit number of rows. To ensure an exact limit on larger files, also pass ntasks=1 to force single-threaded parsing.
For large enough data inputs, ntasks controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads(), which is the number of threads the julia process was started with, either via julia -t N or the JULIA_NUM_THREADS environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1. This argument is only applicable to CSV.File, not CSV.Rows. For CSV.Chunks, it controls the total number of chunk iterations a large file will be split up into for parsing.
When input data is large enough, parsing will attempt to "chunk" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk's first row is found, each parser will check rows_to_check number of rows to ensure the expected number of columns are present.
NOTE: only applicable to vector of inputs passed to CSV.File
A Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Argument to control how missing values are handled while parsing input data. The default is missingstring="", which means two consecutive delimiters, like ,,, will result in a cell being set as a missing value. Otherwise, you can pass a single string to use as a "sentinel", like missingstring="NA", or a vector of strings, where each will be checked for when parsing, like missingstring=["NA", "NAN", "NULL"], and if any match, the cell will be set to missing. By passing missingstring=nothing, no missing values will be checked for while parsing.
A Char or String argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,) if no other delimiter can be detected consistently.
A Bool argument, default false, that, if set to true, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.
A Bool argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false, so parsing avoids treating the quotechar or openquotechar/closequotechar arguments specially.
An ASCII Char argument (or arguments if both openquotechar and closequotechar are provided) that parsing uses to handle "quoted" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar, or start with openquotechar and end with closequotechar so parsing knows to treat the delim or newline as part of the cell value instead of as significant parsing characters. If the quotechar or closequotechar characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.
An ASCII Char argument that parsing uses when parsing quoted cells and the quotechar or closequotechar characters appear in a cell string value. If the escapechar character is encountered inside a quoted cell, it will be "skipped", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar is not included in the value of the cell, but is ignored completely.
A String or AbstractDict argument that controls how parsing detects datetime values in the data input. As a single String (or DateFormat) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time, Date, or DateTime). By default, if no dateformat argument is explicitly provided, parsing will try to detect any of Time, Date, or DateTime types following the standard Dates.ISOTimeFormat, Dates.ISODateFormat, or Dates.ISODateTimeFormat formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict is provided, different dateformat strings can be provided for specific columns; the provided dict can map either an Integer for column number or a String, Symbol or Regex for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
An ASCII Char argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14, the '.' character separates the 3 and 14 values, whereas for 3,14 (common European notation), the ',' character separates the fractional portion. By default, decimal='.'.
A "groupmark" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark, if provided, must be an ASCII Char which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729 is ',' and the groupmark for the US social security number 875-39-3196 is -. By default, groupmark=nothing which indicates that there are no stray characters separating digits.
These arguments can be provided as Vector{String} to specify custom values that should be treated as the Booltrue/false values for all the columns of a data input. By default, ["true", "True", "TRUE", "T", "1"] string values are used to detect true values, and ["false", "False", "FALSE", "F", "0"] string values are used to detect false values. Note that even though "1" and "0"can be used to parse true/false values, in terms of auto detecting column types, those values will be parsed as Int64 first, instead of Bool. To instead parse those values as Bools for a column, you can manually provide that column's type as Bool (see the type argument).
Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type, an AbstractVector of types, an AbstractDict, or a function.
If a single type is provided, like types=Float64, then all columns in the data input will be parsed as Float64. If a column's value isn't a valid Float64 value, then a warning will be emitted, unless silencewarnings=false is passed, then no warning will be printed. However, if strict=true is passed, then an error will be thrown instead, regarldess of the silencewarnings argument.
If a AbstractVector{Type} is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.
If an AbstractDict, then specific columns can have their column type specified with the key of the dict being an Integer for column number, or String or Symbol for column name or Regex matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.
If a function, then it should be of the form (i, name) -> Union{T, Nothing}, and will be applied to each detected column during initial parsing. Returning nothing from the function will result in the column's type being automatically detected during parsing.
By default types=nothing, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass types=Union{Float64, Missing} if the data input contains missing values. Parsing will detect missing values if present, and promote any manually provided column types from the singular (Float64) to the missing equivalent (Union{Float64, Missing}) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64, Float64, Date, DateTime, Time, Bool, String.
Non-standard types can be provided, like Dec64 from the DecFP.jl package, but must support the Base.tryparse(T, str) function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end, as long as a corresponding Base.tryparse definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';'))), where a single cell in the data input is like 1.23;4.56;7.89.
Note that the default stringtype can be overridden by providing a column's type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString), where the first column will be parsed as a String, while any other string columns will have the PosLenString type.
An AbstractDict{Type, Type} argument that allows replacing a non-String standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64, like typemap=IdDict(Int64 => Float64), which would cause any columns detected as Int64 to be parsed as Float64 instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String), which will cause any columns detected as Date to be parsed as String instead.
Argument that controls whether columns will be returned as PooledArrays. Can be provided as a Bool, Float64, Tuple{Float64, Int}, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}. As a Bool, controls absolutely whether a column will be pooled or not; if passed as a single Bool argument like pool=true, then all string columns will be pooled, regardless of cardinality. When passed as a Float64, the value should be between 0.0 and 1.0 to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1, then all string columns with a unique value % less than 10% will be returned as PooledArray, while other string columns will be normal string vectors. If pool is provided as a tuple, like (0.2, 500), the first tuple element is the same as a single Float64 value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500) means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the pool argument is a single Bool, Real, or Tuple{Float64, Int}, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool, Float64, or Tuple{Float64, Int}. Similar to the types argument, providing a vector to pool should have an element for each column in the data input, while a dict argument can map column number/name to Bool, Float64, or Tuple{Float64, Int} for specific columns. Unspecified columns will not be pooled when the argument is a dict.
A Bool argument that controls whether Integer detected column types will be "shrunk" to the smallest possible integer type. Argument is false by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.
An argument that controls the precise type of string columns. Supported values are InlineString (the default), PosLenString, or String. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:
InlineString: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than String. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.
PosLenString: results in columns returned as PosLenStringVector (or ChainedVector{PosLenStringVector} for the multithreaded case), which holds a reference to the original input data, and acts as one large "view" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns. PosLenStringVector, however, does not support traditional mutable operations like regular Vectors, like push!, append!, or deleteat!.
String: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normal Vector{String} (or ChainedVector{Vector{String}}), which can be processed normally, including any mutating operations.
Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing, emit a warning (i.e. silencewarnings=false and strict=false), but only up to 100 total warnings and then they'll be silenced (i.e. maxwarnings=100). If strict=true, then invalid values will result in an error being thrown instead of any warnings emitted.
A Bool argument that controls the printing of extra "debug" information while parsing. Can be useful if parsing doesn't produce the expected result or a bug is suspected in parsing somehow.
Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.
Example
julia> using CSV, DataFrames
julia> path = tempname();
@@ -18,7 +18,7 @@
│ String1 String1 String1
─────┼───────────────────────────
1 │ a b c
- 2 │ 1 2 3
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
Read a UTF-8 CSV input and return a CSV.File object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...) instead if the CSV.File intermediate object isn't needed.
a Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer
a CodeUnits object, which wraps a String, like codeunits(str)
a csv-formatted string can also be passed like IOBuffer(str)
a Cmd or other IO
a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
a Vector of any of the above, which will parse and vertically concatenate each source, returning a single, "long" CSV.File
To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response body can be passed like:
using Downloads, CSV
+ 2 │ 1 2 3
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
Read a UTF-8 CSV input and return a CSV.File object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...) instead if the CSV.File intermediate object isn't needed.
a Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer
a CodeUnits object, which wraps a String, like codeunits(str)
a csv-formatted string can also be passed like IOBuffer(str)
a Cmd or other IO
a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
a Vector of any of the above, which will parse and vertically concatenate each source, returning a single, "long" CSV.File
To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response body can be passed like:
using Downloads, CSV
f = CSV.File(Downloads.download(url))
# or
@@ -34,6 +34,6 @@
# load a csv file directly into an sqlite database table
db = SQLite.DB()
-tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.
The ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.
Each iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.
This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
a Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer
a CodeUnits object, which wraps a String, like codeunits(str)
a csv-formatted string can also be passed like IOBuffer(str)
a Cmd or other IO
a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:
f = CSV.Rows(HTTP.get(url).body)
For other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).
While similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:
Very minimal memory footprint; while iterating, only the current row values are buffered
Only provides row access via iteration; to access columns, one can stream the rows into a table type
Performs no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments
Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:
for row in CSV.Rows(file)
+tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.
The ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.
Each iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.
This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
a Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer
a CodeUnits object, which wraps a String, like codeunits(str)
a csv-formatted string can also be passed like IOBuffer(str)
a Cmd or other IO
a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:
f = CSV.Rows(HTTP.get(url).body)
For other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).
While similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:
Very minimal memory footprint; while iterating, only the current row values are buffered
Only provides row access via iteration; to access columns, one can stream the rows into a table type
Performs no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments
Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:
for row in CSV.Rows(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
-end
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
Use the same logic used by CSV.File to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows type, which returns each cell of a file as a String. The order of types attempted is: Int, Float64, Date, DateTime, Bool, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options type as a keyword argument option=ops for sentinel value detection.
The types that are detected by default when column types are not provided by the user otherwise. They include: Int64, Float64, Date, DateTime, Time, Bool, and String.
For all parsing functionality, newlines are detected/parsed automatically, regardless if they're present in the data as a single newline character ('\n'), single return character ('\r'), or full CRLF sequence ("\r\n").
Refers to the ratio of unique values to total number of values in a column. Columns with "low cardinality" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with "high cardinality" have a high % of unique values relative to total number of values. Think of these as "id-like" columns where each or almost each value is a unique identifier with no (or few) repeated values.
Settings
This document was generated with Documenter.jl version 0.27.25 on Thursday 10 August 2023. Using Julia version 1.9.2.
+end
Arguments
File layout options:
header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
escapechar='"': the Char used to escape quote characters in a quoted field
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
Use the same logic used by CSV.File to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows type, which returns each cell of a file as a String. The order of types attempted is: Int, Float64, Date, DateTime, Bool, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options type as a keyword argument option=ops for sentinel value detection.
The types that are detected by default when column types are not provided by the user otherwise. They include: Int64, Float64, Date, DateTime, Time, Bool, and String.
For all parsing functionality, newlines are detected/parsed automatically, regardless if they're present in the data as a single newline character ('\n'), single return character ('\r'), or full CRLF sequence ("\r\n").
Refers to the ratio of unique values to total number of values in a column. Columns with "low cardinality" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with "high cardinality" have a high % of unique values relative to total number of values. Think of these as "id-like" columns where each or almost each value is a unique identifier with no (or few) repeated values.
Settings
This document was generated with Documenter.jl version 1.2.1 on Wednesday 28 February 2024. Using Julia version 1.10.1.
This document was generated with Documenter.jl version 0.27.25 on Thursday 10 August 2023. Using Julia version 1.9.2.
diff --git a/dev/search_index.js b/dev/search_index.js
index 3d583cde..28a83278 100644
--- a/dev/search_index.js
+++ b/dev/search_index.js
@@ -1,3 +1,3 @@
var documenterSearchIndex = {"docs":
-[{"location":"examples.html#Examples","page":"Examples","title":"Examples","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"Pages = [\"examples.md\"]","category":"page"},{"location":"examples.html#stringencodings","page":"Examples","title":"Non-UTF-8 character encodings","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"# assume I have csv text data encoded in ISO-8859-1 encoding\n# I load the StringEncodings package, which provides encoding conversion functionality\nusing CSV, StringEncodings\n\n# I open my `iso8859_encoded_file.csv` with the `enc\"ISO-8859-1\"` encoding\n# and pass the opened IO object to `CSV.File`, which will read the entire\n# input into a temporary file, then parse the data from the temp file\nfile = CSV.File(open(\"iso8859_encoded_file.csv\", enc\"ISO-8859-1\"))\n\n# to instead have the encoding conversion happen in memory, pass\n# `buffer_in_memory=true`; this can be faster, but obviously results\n# in more memory being used rather than disk via a temp file\nfile = CSV.File(open(\"iso8859_encoded_file.csv\", enc\"ISO-8859-1\"); buffer_in_memory=true)","category":"page"},{"location":"examples.html#vectorinputs","page":"Examples","title":"Concatenate multiple inputs at once","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, I have a vector of delimited data inputs that each have\n# matching schema (the same column names and types). I'd like to process all\n# of the inputs together and vertically concatenate them into one \"long\" table.\ndata = [\n \"a,b,c\\n1,2,3\\n4,5,6\\n\",\n \"a,b,c\\n7,8,9\\n10,11,12\\n\",\n \"a,b,c\\n13,14,15\\n16,17,18\",\n]\n\n# I can just pass a `Vector` of inputs, in this case `IOBuffer(::String)`, but it\n# could also be a `Vector` of any valid input source, like `AbstractVector{UInt8}`,\n# filenames, `IO`, etc. Each input will be processed on a separate thread, with the results\n# being vertically concatenated afterwards as a single `CSV.File`. Each thread's columns\n# will be lazily concatenated using the `ChainedVector` type. As always, if we want to\n# send the parsed columns directly to a sink function, we can use `CSV.read`, like\n# `df = CSV.read(map(IOBuffer, data), DataFrame)`.\nf = CSV.File(map(IOBuffer, data))","category":"page"},{"location":"examples.html#gzipped_input","page":"Examples","title":"Gzipped input","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"# assume I have csv text data compressed via gzip\n# no additional packages are needed; CSV.jl can decompress automatically\nusing CSV\n\n# pass name of gzipped input file directly; data will be decompressed to a\n# temporary file, then mmapped as a byte buffer for actual parsing\nfile = CSV.File(\"data.gz\")\n\n# to instead have the decompression happen in memory, pass\n# `buffer_in_memory=true`; this can be faster, but obviously results\n# in more memory being used rather than disk via a temp file\nfile = CSV.File(\"data.gz\"; buffer_in_memory=true)","category":"page"},{"location":"examples.html#csv_string","page":"Examples","title":"Delimited data in a string","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# I have csv data in a string I want to parse\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n\"\"\"\n\n# Calling `IOBuffer` on a string returns an in-memory IO object\n# of the string data, which can be passed to `CSV.File` for parsing\nfile = CSV.File(IOBuffer(data))","category":"page"},{"location":"examples.html#http","page":"Examples","title":"Data from the web/a url","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"# assume there's delimited data I want to read from the web\n# one option is to use the HTTP.jl package\nusing CSV, HTTP\n\n# I first make the web request to get the data via `HTTP.get` on the `url`\nhttp_response = HTTP.get(url)\n\n# I can then access the data of the response as a `Vector{UInt8}` and pass\n# it directly to `CSV.File` for parsing\nfile = CSV.File(http_response.body)\n\n# another option, with Julia 1.6+, is using the Downloads stdlib\nusing Downloads\nhttp_response = Downloads.download(url)\n\n# by default, `Downloads.download` writes the response data to a temporary file\n# which can then be passed to `CSV.File` for parsing\nfile = CSV.File(http_response)","category":"page"},{"location":"examples.html#zip_example","page":"Examples","title":"Reading from a zip file","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using ZipFile, CSV, DataFrames\n\na = DataFrame(a = 1:3)\nCSV.write(\"a.csv\", a)\n\n# zip the file; Windows users who do not have zip available on the PATH can manually zip the CSV\n# or write directly into the zip archive as shown below\n;zip a.zip a.csv\n\n# alternatively, write directly into the zip archive (without creating an unzipped csv file first)\nz = ZipFile.Writer(\"a2.zip\")\nf = ZipFile.addfile(z, \"a.csv\", method=ZipFile.Deflate)\na |> CSV.write(f)\nclose(z)\n\n# read file from zip archive\nz = ZipFile.Reader(\"a.zip\") # or \"a2.zip\"\n\n# identify the right file in zip\na_file_in_zip = filter(x->x.name == \"a.csv\", z.files)[1]\n\na_copy = CSV.File(a_file_in_zip) |> DataFrame\n\na == a_copy\n\nclose(z)","category":"page"},{"location":"examples.html#second_row_header","page":"Examples","title":"Column names on 2nd row","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\ndata = \"\"\"\ndescriptive row with information about the file that we'd like to ignore\na,b,c\n1,2,3\n4,5,6\n\"\"\"\n\n# by passing header=2, parsing will ignore the 1st row entirely\n# then parse the column names on row 2, then by default, it assumes\n# the data starts on the row after the column names (row 3 in this case)\n# which is correct for this case\nfile = CSV.File(IOBuffer(data); header=2)","category":"page"},{"location":"examples.html#no_header","page":"Examples","title":"No column names in data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our data doesn't have any column names\ndata = \"\"\"\n1,2,3\n4,5,6\n\"\"\"\n\n# by passing `header=false`, parsing won't worry about looking for column names\n# anywhere, but instead just start parsing the data and generate column names\n# as needed, like `Column1`, `Column2`, and `Column3` in this case\nfile = CSV.File(IOBuffer(data); header=false)","category":"page"},{"location":"examples.html#manual_header","page":"Examples","title":"Manually provide column names","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our data doesn't have any column names\ndata = \"\"\"\n1,2,3\n4,5,6\n\"\"\"\n\n# instead of passing `header=false` and getting auto-generated column names,\n# we can instead pass the column names ourselves\nfile = CSV.File(IOBuffer(data); header=[\"a\", \"b\", \"c\"])\n\n# we can also pass the column names as Symbols; a copy of the manually provided\n# column names will always be made and then converted to `Vector{Symbol}`\nfile = CSV.File(IOBuffer(data); header=[:a, :b, :c])","category":"page"},{"location":"examples.html#multi_row_header","page":"Examples","title":"Multi-row column names","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our column names are `col_a`, `col_b`, and `col_c`,\n# but split over the first and second rows\ndata = \"\"\"\ncol,col,col\na,b,c\n1,2,3\n4,5,6\n\"\"\"\n\n# by passing a collection of integers, parsing will parse each row in the collection\n# and concatenate the values for each column, separating rows with `_` character\nfile = CSV.File(IOBuffer(data); header=[1, 2])","category":"page"},{"location":"examples.html#normalize_header","page":"Examples","title":"Normalizing column names","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our data are single letters, with column names of \"1\", \"2\", and \"3\"\n# A single digit isn't a valid identifier in Julia, meaning we couldn't do something\n# like `1 = 2 + 2`, where `1` would be a variable name\ndata = \"\"\"\n1,2,3\na,b,c\nd,e,f\nh,i,j\n\"\"\"\n\n# in order to have valid identifiers for column names, we can pass\n# `normalizenames=true`, which result in our column names becoming \"_1\", \"_2\", and \"_3\"\n# note this isn't required, but can be convenient in certain cases\nfile = CSV.File(IOBuffer(data); normalizenames=true)\n\n# we can acces the first column like\nfile._1\n\n# another example where we may want to normalize is column names with spaces in them\ndata = \"\"\"\ncolumn one,column two, column three\n1,2,3\n4,5,6\n\"\"\"\n\n# normalizing will result in column names like \"column_one\", \"column_two\" and \"column_three\"\nfile = CSV.File(IOBuffer(data); normalizenames=true)","category":"page"},{"location":"examples.html#skipto_example","page":"Examples","title":"Skip to specific row where data starts","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data has a first row that we'd like to ignore; our data also doesn't have\n# column names, so we'd like them to be auto-generated\ndata = \"\"\"\ndescriptive row that gives information about the data that we'd like to ignore\n1,2,3\n4,5,6\n\"\"\"\n\n# with no column names in the data, we first pass `header=false`; by itself,\n# this would result in parsing starting on row 1 to parse the actual data;\n# but we'd like to ignore the first row, so we pass `skipto=2` to skip over\n# the first row; our colum names will be generated like `Column1`, `Column2`, `Column3`\nfile = CSV.File(IOBuffer(data); header=false, skipto=2)","category":"page"},{"location":"examples.html#footerskip_example","page":"Examples","title":"Skipping trailing useless rows","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data has column names of \"a\", \"b\", and \"c\"\n# but at the end of the data, we have 2 rows we'd like to ignore while parsing\n# since they're not properly delimited\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n7,8,9\ntotals: 12, 15, 18\ngrand total: 45\n\"\"\"\n\n# by passing `footerskip=2`, we tell parsing to start the end of the data and\n# read 2 rows, ignoring their contents, then mark the ending position where\n# the normal parsing process should finish\nfile = CSV.File(IOBuffer(data); footerskip=2)","category":"page"},{"location":"examples.html#transpose_example","page":"Examples","title":"Reading transposed data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data is transposed, meaning our column names are in the first column,\n# with the data for column \"a\" all on the first row, data for column \"b\"\n# all on the second row, and so on.\ndata = \"\"\"\na,1,4,7\nb,2,5,8\nc,3,6,9\n\"\"\"\n\n# by passing `transpose=true`, parsing will look for column names in the first\n# column of data, then parse each row as a separate column\nfile = CSV.File(IOBuffer(data); transpose=true)","category":"page"},{"location":"examples.html#comment_example","page":"Examples","title":"Ignoring commented rows","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# here, we have several non-data rows that all begin with the \"#\" string\ndata = \"\"\"\n# row describing column names\na,b,c\n# row describing first row of data\n1,2,3\n# row describing second row of data\n4,5,6\n\"\"\"\n\n# we want to ignore these \"commented\" rows\nfile = CSV.File(IOBuffer(data); comment=\"#\")","category":"page"},{"location":"examples.html#ignoreemptyrows_example","page":"Examples","title":"Ignoring empty rows","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# here, we have a \"gap\" row in between the first and second row of data\n# by default, these \"empty\" rows are ignored, but in our case, this is\n# how a row of data is input when all columns have missing/null values\n# so we don't want those rows to be ignored so we can know how many\n# missing cases there are in our data\ndata = \"\"\"\na,b,c\n1,2,3\n\n4,5,6\n\"\"\"\n\n# by passing `ignoreemptyrows=false`, we ensure parsing treats an empty row\n# as each column having a `missing` value set for that row\nfile = CSV.File(IOBuffer(data); ignoreemptyrows=true)","category":"page"},{"location":"examples.html#select_example","page":"Examples","title":"Including/excluding columns","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# simple dataset, but we know column \"b\" isn't needed\n# so we'd like to save time by having parsing ignore it completely\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n7,8,9\n\"\"\"\n\n# there are quite a few ways to provide the select/drop arguments\n# so we provide an example of each, first for selecting the columns\n# \"a\" and \"c\" that we want to include or keep from parsing\nfile = CSV.File(IOBuffer(data); select=[1, 3])\nfile = CSV.File(IOBuffer(data); select=[:a, :c])\nfile = CSV.File(IOBuffer(data); select=[\"a\", \"c\"])\nfile = CSV.File(IOBuffer(data); select=[true, false, true])\nfile = CSV.File(IOBuffer(data); select=(i, nm) -> i in (1, 3))\n# now examples of dropping, when we'd rather specify the column(s)\n# we'd like to drop/exclude from parsing\nfile = CSV.File(IOBuffer(data); drop=[2])\nfile = CSV.File(IOBuffer(data); drop=[:b])\nfile = CSV.File(IOBuffer(data); drop=[\"b\"])\nfile = CSV.File(IOBuffer(data); drop=[false, true, false])\nfile = CSV.File(IOBuffer(data); drop=(i, nm) -> i == 2)","category":"page"},{"location":"examples.html#limit_example","page":"Examples","title":"Limiting number of rows from data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# here, we have quite a few rows of data (relative to other examples, lol)\n# but we know we only need the first 3 for the analysis we need to do\n# so instead of spending the time parsing the entire file, we'd like\n# to just read the first 3 rows and ignore the rest\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n7,8,9\n10,11,12\n13,14,15\n\"\"\"\n\n# parsing will start reading rows, and once 3 have been read, it will\n# terminate early, avoiding the parsing of the rest of the data entirely\nfile = CSV.File(IOBuffer(data); limit=3)","category":"page"},{"location":"examples.html#missing_string_example","page":"Examples","title":"Specifying custom missing strings","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this data, our first column has \"missing\" values coded with -999\n# but our score column has \"NA\" instead\n# we'd like either of those values to show up as `missing` after we parse the data\ndata = \"\"\"\ncode,age,score\n0,21,3.42\n1,42,6.55\n-999,81,NA\n-999,83,NA\n\"\"\"\n\n# by passing missingstring=[\"-999\", \"NA\"], parsing will check each cell if it matches\n# either string in order to set the value of the cell to `missing`\nfile = CSV.File(IOBuffer(data); missingstring=[\"-999\", \"NA\"])","category":"page"},{"location":"examples.html#string_delim","page":"Examples","title":"String delimiter","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data has two columns, separated by double colon\n# characters (\"::\")\ndata = \"\"\"\ncol1::col2\n1::2\n3::4\n\"\"\"\n\n# we can pass a single character or string for delim\nfile = CSV.File(IOBuffer(data); delim=\"::\")","category":"page"},{"location":"examples.html#ignorerepeated_example","page":"Examples","title":"Fixed width files","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# This is an example of \"fixed width\" data, where each\n# column is the same number of characters away from each\n# other on each row. Fields are \"padded\" with extra\n# delimiters (in this case `' '`) so that each column is\n# the same number of characters each time\ndata = \"\"\"\ncol1 col2 col3\n123431 2 3421\n2355 346 7543\n\"\"\"\n# In addition to our `delim`, we can pass\n# `ignorerepeated=true`, which tells parsing that\n#consecutive delimiters should be treated as a single\n# delimiter.\nfile = CSV.File(IOBuffer(data); delim=' ', ignorerepeated=true)","category":"page"},{"location":"examples.html#quoted_example","page":"Examples","title":"Turning off quoted cell parsing","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# by default, cells like the 1st column, 2nd row\n# will be treated as \"quoted\" cells, where they start\n# and end with the quote character '\"'. The quotes will\n# be removed from the final parsed value\n# we may, however, want the \"raw\" value and _not_ ignore\n# the quote characters in the final value\ndata = \"\"\"\na,b,c\n\"hey\",2,3\nthere,4,5\nsailor,6,7\n\"\"\"\n\n# we can \"turn off\" the detection of quoted cells\n# by passing `quoted=false`\nfile = CSV.File(IOBuffer(data); quoted=false)","category":"page"},{"location":"examples.html#quotechar_example","page":"Examples","title":"Quoted & escaped fields","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this data, we have a few \"quoted\" fields, which means the field's value starts and ends with `quotechar` (or\n# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise\n# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise\n# signficant characters until the closing quote character is found. For quoted fields that need to also include the quote\n# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote\n# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default\n# values, so just doing `CSV.File(IOBuffer(data))` would result in successful parsing.\ndata = \"\"\"\ncol1,col2\n\"quoted field with a delimiter , inside\",\"quoted field that contains a \\\\n newline and \"\"inner quotes\\\"\\\"\\\"\nunquoted field,unquoted field with \"inner quotes\"\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); quotechar='\"', escapechar='\"')\n\nfile = CSV.File(IOBuffer(data); openquotechar='\"' closequotechar='\"', escapechar='\"')","category":"page"},{"location":"examples.html#dateformat_example","page":"Examples","title":"DateFormat","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, our `date` column has dates that are formatted like `yyyy/mm/dd`. We can pass just such a string to the\n# `dateformat` keyword argument to tell parsing to use it when looking for `Date` or `DateTime` columns. Note that currently,\n# only a single `dateformat` string can be passed to parsing, meaning multiple columns with different date formats cannot all\n# be parsed as `Date`/`DateTime`.\ndata = \"\"\"\ncode,date\n0,2019/01/01\n1,2019/01/02\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); dateformat=\"yyyy/mm/dd\")","category":"page"},{"location":"examples.html#decimal_example","page":"Examples","title":"Custom decimal separator","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In many places in the world, floating point number decimals are separated with a comma instead of a period (`3,14` vs. `3.14`)\n# . We can correctly parse these numbers by passing in the `decimal=','` keyword argument. Note that we probably need to\n# explicitly pass `delim=';'` in this case, since the parser will probably think that it detected `','` as the delimiter.\ndata = \"\"\"\ncol1;col2;col3\n1,01;2,02;3,03\n4,04;5,05;6,06\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); delim=';', decimal=',')","category":"page"},{"location":"examples.html#thousands_example","page":"Examples","title":"Thousands separator","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In many places in the world, digits to the left of the decimal place are broken into\n# groups by a thousands separator. We can ignore those separators by passing the `groupmark`\n# keyword argument.\ndata = \"\"\"\nx y\n1 2\n2 1,729\n3 87,539,319\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); groupmark=',')","category":"page"},{"location":"examples.html#groupmark_example","page":"Examples","title":"Custom groupmarks","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In some contexts, separators other than thousands separators group digits in a number.\n# `groupmark` supports ignoring them as long as the separator character is ASCII\ndata = \"\"\"\nname;ssn;credit card number\nAyodele Beren;597-21-8366;5538-6111-0574-2633\nTrinidad Shiori;387-35-5126;3017-9300-0776-5301\nOri Cherokee;731-12-4606;4682-5416-0636-3877\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); groupmark='-')","category":"page"},{"location":"examples.html#truestrings_example","page":"Examples","title":"Custom bool strings","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# By default, parsing only considers the string values `true` and `false` as valid `Bool` values. To consider alternative\n# values, we can pass a `Vector{String}` to the `truestrings` and `falsestrings` keyword arguments.\ndata = \"\"\"\nid,paid,attended\n0,T,TRUE\n1,F,TRUE\n2,T,FALSE\n3,F,FALSE\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); truestrings=[\"T\", \"TRUE\"], falsestrings=[\"F\", \"FALSE\"])","category":"page"},{"location":"examples.html#matrix_example","page":"Examples","title":"Matrix-like Data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can\n# also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`,\n# without having to guess the type on its own.\ndata = \"\"\"\n1.0 0.0 0.0\n0.0 1.0 0.0\n0.0 0.0 1.0\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); header=false)\nfile = CSV.File(IOBuffer(data); header=false, delim=' ', types=Float64)\n\n# as a last step if you want to convert this to a Matrix, this can be done by reading in first as a DataFrame and then\n# function chaining to a Matrix\nusing DataFrames\nA = file|>DataFrame|>Matrix\n\n# another alternative is to simply use CSV.Tables.matrix and say\nB = file|>CSV.Tables.matrix # does not require DataFrames","category":"page"},{"location":"examples.html#types_example","page":"Examples","title":"Providing types","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, our 3rd column has an invalid value on the 2nd row `invalid`. Let's imagine we'd still like to treat it as an\n# `Int` column, and ignore the `invalid` value. The syntax examples provide several ways we can tell parsing to treat the 3rd\n# column as `Int`, by referring to column index `3`, or column name with `Symbol` or `String`. We can also provide an entire\n# `Vector` of types for each column (and which needs to match the length of columns in the file). There are two additional\n# keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like\n# `\"warning: invalid Int64 value on row 2, column 3\"`. In the fifth example, passing `silencewarnings=true` will suppress this\n# warning printing. In the last syntax example, passing `strict=true` will result in an error being thrown during parsing.\ndata = \"\"\"\ncol1,col2,col3\n1,2,3\n4,5,invalid\n6,7,8\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); types=Dict(3 => Int))\nfile = CSV.File(IOBuffer(data); types=Dict(:col3 => Int))\nfile = CSV.File(IOBuffer(data); types=Dict(\"col3\" => Int))\nfile = CSV.File(IOBuffer(data); types=[Int, Int, Int])\nfile = CSV.File(IOBuffer(data); types=[Int, Int, Int], silencewarnings=true)\nfile = CSV.File(IOBuffer(data); types=[Int, Int, Int], strict=true)\n\n\n# In this file we have lots of columns, and would like to specify the same type for all\n# columns except one which should have a different type. We can do this by providing a\n# function that takes the column index and column name and uses these to decide the type.\ndata = \"\"\"\ncol1,col2,col3,col4,col5,col6,col7\n1,2,3,4,5,6,7\n0,2,3,4,5,6,7\n1,2,3,4,5,6,7\n\"\"\"\nfile = CSV.File(IOBuffer(data); types=(i, name) -> i == 1 ? Bool : Int8)\nfile = CSV.File(IOBuffer(data); types=(i, name) -> name == :col1 ? Bool : Int8)\n# Alternatively by providing the exact name for the first column and a Regex to match the rest.\n# Note that an exact column name always takes precedence over a regular expression.\nfile = CSV.File(IOBuffer(data); types=Dict(:col1 => Bool, r\"^col\\d\" => Int8))","category":"page"},{"location":"examples.html#typemap_example","page":"Examples","title":"Typemap","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, we have U.S. zipcodes in the first column that we'd rather not treat as `Int`, but parsing will detect it as\n# such. In the first syntax example, we pass `typemap=IdDict(Int => String)`, which tells parsing to treat any detected `Int`\n# columns as `String` instead. In the second syntax example, we alternatively set the `zipcode` column type manually.\ndata = \"\"\"\nzipcode,score\n03494,9.9\n12345,6.7\n84044,3.4\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); typemap=IdDict(Int => String))\nfile = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))","category":"page"},{"location":"examples.html#pool_example","page":"Examples","title":"Pooled values","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations\n# like joining and grouping when `String` values are \"pooled\", meaning each unique value is mapped to a `UInt32`. By default,\n# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide\n# greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.\ndata = \"\"\"\nid,code\nA18E9,AT\nBF392,GC\n93EBC,AT\n54EE1,AT\n8CD2E,GC\n\"\"\"\n\nfile = CSV.File(IOBuffer(data))\nfile = CSV.File(IOBuffer(data); pool=0.4)\nfile = CSV.File(IOBuffer(data); pool=0.6)","category":"page"},{"location":"examples.html#nonstring_pool_example","page":"Examples","title":"Non-string pooled values","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this data, our `category` column is an integer type, but represents a limited set of values that could benefit from\n# pooling. Indeed, we may want to do various DataFrame grouping/joining operations on the column, which can be more\n# efficient if the column type is a PooledVector. By default, passing `pool=true` will only pool string column types,\n# if we pass a vector or dict however, we can specify how specific, non-string type, columns should be pooled.\ndata = \"\"\"\ncategory,amount\n1,100.01\n1,101.10\n2,201.10\n2,202.40\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); pool=Dict(1 => true))\nfile = CSV.File(IOBuffer(data); pool=[true, false])","category":"page"},{"location":"examples.html#pool_absolute_threshold","page":"Examples","title":"Pool with absolute threshold","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations\n# like joining and grouping when `String` values are \"pooled\", meaning each unique value is mapped to a `UInt32`. By default,\n# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide\n# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.\ndata = \"\"\"\nid,code\nA18E9,AT\nBF392,GC\n93EBC,AT\n54EE1,AT\n8CD2E,GC\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); pool=(0.5, 2))","category":"page"},{"location":"index.html#CSV.jl-Documentation","page":"Home","title":"CSV.jl Documentation","text":"","category":"section"},{"location":"index.html","page":"Home","title":"Home","text":"GitHub Repo: https://github.com/JuliaData/CSV.jl","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"Welcome to CSV.jl! A pure-Julia package for handling delimited text data, be it comma-delimited (csv), tab-delimited (tsv), or otherwise.","category":"page"},{"location":"index.html#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"index.html","page":"Home","title":"Home","text":"You can install CSV by typing the following in the Julia REPL:","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"] add CSV ","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"followed by ","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"using CSV","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"to load the package.","category":"page"},{"location":"index.html#Overview","page":"Home","title":"Overview","text":"","category":"section"},{"location":"index.html","page":"Home","title":"Home","text":"To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"CSV.File: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns a CSV.File object, which is like a lightweight table/DataFrame. Assuming file is a variable of a CSV.File object, individual columns can be accessed like file.col1, file[:col1], or file[\"col\"]. You can see parsed column names via file.names. A CSV.File can also be iterated, where a CSV.Row is produced on each iteration, which allows access to each value in the row via row.col1, row[:col1], or row[1]. You can also index a CSV.File directly, like file[1] to return the entire CSV.Row at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated. CSV.File satisfies the Tables.jl \"source\" interface, and so can be passed to valid sink functions like DataFrame, SQLite.load!, Arrow.write, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.\nCSV.read: a convenience function identical to CSV.File, but used when a CSV.File will be passed directly to a sink function, like a DataFrame. In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made. Keyword arguments are identical to CSV.File. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like CSV.File, a vector of data inputs can be passed as the 1st argument, which will result in a single \"long\" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).\nCSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows \"streaming\" the data with a lower memory footprint than CSV.File. Supports many of the same options as CSV.File, except column type handling is a little different. By default, every column type will be essentially Union{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a CSV.Rows object, rows can be \"streamed\" by iterating, where each iteration produces a CSV.Row2 object, which operates similar to CSV.File's CSV.Row type where individual row values can be accessed via row.col1, row[:col1], or row[1]. If each row is processed individually, additional memory can be saved by passing reusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row. CSV.Rows also supports the Tables.jl interface and can also be passed to valid sink functions.\nCSV.Chunks: similar to CSV.File, but allows passing a ntasks::Integer keyword argument which will cause the input file to be \"chunked\" up into ntasks number of chunks. After constructing a CSV.Chunks object, each iteration of the object will return a CSV.File of the next parsed chunk. Useful for processing extremely large files in \"chunks\". Because each iterated element is a valid Tables.jl \"source\", CSV.Chunks satisfies the Tables.partitions interface, so sinks that can process input partitions can operate by passing CSV.Chunks as the \"source\".\nCSV.write: A valid Tables.jl \"sink\" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out a DataFrame, for example, it's simply CSV.write(\"data.csv\", df), or to write out a matrix, it's using Tables; CSV.write(\"data.csv\", Tables.table(mat))\nCSV.RowWriter: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"That's quite a bit! Let's boil down a TL;DR:","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"Just want to read a delimited file or collection of files and do basic stuff with data? Use CSV.File(file) or CSV.read(file, DataFrame)\nDon't need the data as a whole or want to stream through a large file row-by-row? Use CSV.Rows.\nWant to process a large file in \"batches\"/chunks? Use CSV.Chunks.\nNeed to produce a csv? Use CSV.write.\nWant to iterate an input table and produce a single csv string per row? CSV.RowWriter.","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"For the rest of the manual, we're going to have two big sections, Reading and Writing where we'll walk through the various options to CSV.File/CSV.read/CSV.Rows/CSV.Chunks and CSV.write/CSV.RowWriter.","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"Pages = [\"reading.md\", \"writing.md\", \"examples.md\"]","category":"page"},{"location":"reading.html#Reading","page":"Reading","title":"Reading","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"The format for this section will go through the various inputs/options supported by CSV.File/CSV.read, with notes about compatibility with the other reading functionality (CSV.Rows, CSV.Chunks, etc.).","category":"page"},{"location":"reading.html#input","page":"Reading","title":"input","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let's look at the various supported input types:","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.\nVector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))\nIO or Cmd: you can pass an IO or Cmd directly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, pass buffer_in_memory=true.\nFor files from the web, you can call HTTP.get(url).body to request the file, then access the data as a Vector{UInt8} from the body field, which can be passed directly for parsing. For Julia 1.6+, you can also use the Downloads stdlib, like Downloads.download(url) which can be passed to parsing","category":"page"},{"location":"reading.html#Examples","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"StringEncodings.jl example\nVector of inputs example\nGzip input\nDelimited data in a string\nData from the web\nData in zip archive","category":"page"},{"location":"reading.html#header","page":"Reading","title":"header","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"The header keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1. Alternative valid aguments for header include:","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Integer, e.g. header=2: provide the row number as an Integer where the column names can be found\nBool, e.g. header=false: no column names exist in the data; column names will be auto-generated depending on the # of columns, like Column1, Column2, etc.\nVector{String} or Vector{Symbol}: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of the Vector will be made and converted to Vector{Symbol}\nAbstractVector{<:Integer}: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names","category":"page"},{"location":"reading.html#Examples-2","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Column names on second row\nNo column names in the data\nManually provide column names\nMulti-row column names","category":"page"},{"location":"reading.html#normalizenames","page":"Reading","title":"normalizenames","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Controls whether column names will be \"normalized\" to valid Julia identifiers. By default, this is false. If normalizenames=true, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty, like file.col1. The identifier that comes after the . must be valid, so spaces or identifiers starting with numbers aren't allowed.","category":"page"},{"location":"reading.html#Examples-3","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Normalizing column names","category":"page"},{"location":"reading.html#skipto","page":"Reading","title":"skipto","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An Integer can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false, or column names are provided manually as Vector{String} or Vector{Symbol}, the data is assumed to start on row 1, i.e. skipto=1.","category":"page"},{"location":"reading.html#Examples-4","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Skip to specific row where data starts","category":"page"},{"location":"reading.html#footerskip","page":"Reading","title":"footerskip","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An Integer argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip # of rows have been parsed, then parsing the entire file, stopping at the newly adjusted \"end of file\".","category":"page"},{"location":"reading.html#Examples-5","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Skipping trailing useless rows","category":"page"},{"location":"reading.html#transpose","page":"Reading","title":"transpose","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"If transpose=true is passed, data will be read \"transposed\", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a \"long\" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.","category":"page"},{"location":"reading.html#Examples-6","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Reading transposed data","category":"page"},{"location":"reading.html#comment","page":"Reading","title":"comment","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A String argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header, skipto, or footerskip arguments, it should be noted that commented rows, while ignored, still count as \"rows\" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6, even if row 5 is a commented row and will be ignored.","category":"page"},{"location":"reading.html#Examples-7","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Ignoring commented rows","category":"page"},{"location":"reading.html#ignoreemptyrows","page":"Reading","title":"ignoreemptyrows","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"This argument specifies whether \"empty rows\", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false, then for an empty row, all existing columns will have missing assigned to their value for that row. Similar to commented rows, empty rows also still count as \"rows\" when any of the header, skipto, or footerskip arguments are provided.","category":"page"},{"location":"reading.html#Examples-8","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Ignoring empty rows","category":"page"},{"location":"reading.html#select","page":"Reading","title":"select / drop","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Arguments that control which columns from the input data will actually be parsed and available after processing. select controls which columns will be accessible after parsing while drop controls which columns to ignore. Either argument can be provided as a vector of Integer, String, or Symbol, specifing the column numbers or names to include/exclude. A vector of Bool matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.","category":"page"},{"location":"reading.html#Examples-9","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Including/excluding columns","category":"page"},{"location":"reading.html#limit","page":"Reading","title":"limit","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An Integer argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit number of rows. To ensure an exact limit on larger files, also pass ntasks=1 to force single-threaded parsing.","category":"page"},{"location":"reading.html#Examples-10","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Limiting number of rows from data","category":"page"},{"location":"reading.html#ntasks","page":"Reading","title":"ntasks","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"NOTE: not applicable to CSV.Rows","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"For large enough data inputs, ntasks controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads(), which is the number of threads the julia process was started with, either via julia -t N or the JULIA_NUM_THREADS environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1. This argument is only applicable to CSV.File, not CSV.Rows. For CSV.Chunks, it controls the total number of chunk iterations a large file will be split up into for parsing.","category":"page"},{"location":"reading.html#rows_to_check","page":"Reading","title":"rows_to_check","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"NOTE: not applicable to CSV.Rows","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"When input data is large enough, parsing will attempt to \"chunk\" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk's first row is found, each parser will check rows_to_check number of rows to ensure the expected number of columns are present.","category":"page"},{"location":"reading.html#source","page":"Reading","title":"source","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"NOTE: only applicable to vector of inputs passed to CSV.File","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.","category":"page"},{"location":"reading.html#missingstring","page":"Reading","title":"missingstring","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Argument to control how missing values are handled while parsing input data. The default is missingstring=\"\", which means two consecutive delimiters, like ,,, will result in a cell being set as a missing value. Otherwise, you can pass a single string to use as a \"sentinel\", like missingstring=\"NA\", or a vector of strings, where each will be checked for when parsing, like missingstring=[\"NA\", \"NAN\", \"NULL\"], and if any match, the cell will be set to missing. By passing missingstring=nothing, no missing values will be checked for while parsing.","category":"page"},{"location":"reading.html#Examples-11","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Specifying custom missing strings","category":"page"},{"location":"reading.html#delim","page":"Reading","title":"delim","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Char or String argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,) if no other delimiter can be detected consistently.","category":"page"},{"location":"reading.html#Examples-12","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"String delimiter","category":"page"},{"location":"reading.html#ignorerepeated","page":"Reading","title":"ignorerepeated","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument, default false, that, if set to true, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.","category":"page"},{"location":"reading.html#Examples-13","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Fixed width files","category":"page"},{"location":"reading.html#quoted","page":"Reading","title":"quoted","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false, so parsing avoids treating the quotechar or openquotechar/closequotechar arguments specially.","category":"page"},{"location":"reading.html#Examples-14","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Turning off quoted cell parsing","category":"page"},{"location":"reading.html#quotechar","page":"Reading","title":"quotechar / openquotechar / closequotechar","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An ASCII Char argument (or arguments if both openquotechar and closequotechar are provided) that parsing uses to handle \"quoted\" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar, or start with openquotechar and end with closequotechar so parsing knows to treat the delim or newline as part of the cell value instead of as significant parsing characters. If the quotechar or closequotechar characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.","category":"page"},{"location":"reading.html#Examples-15","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Quoted & escaped fields","category":"page"},{"location":"reading.html#escapechar","page":"Reading","title":"escapechar","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An ASCII Char argument that parsing uses when parsing quoted cells and the quotechar or closequotechar characters appear in a cell string value. If the escapechar character is encountered inside a quoted cell, it will be \"skipped\", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar is not included in the value of the cell, but is ignored completely.","category":"page"},{"location":"reading.html#dateformat","page":"Reading","title":"dateformat","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A String or AbstractDict argument that controls how parsing detects datetime values in the data input. As a single String (or DateFormat) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time, Date, or DateTime). By default, if no dateformat argument is explicitly provided, parsing will try to detect any of Time, Date, or DateTime types following the standard Dates.ISOTimeFormat, Dates.ISODateFormat, or Dates.ISODateTimeFormat formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict is provided, different dateformat strings can be provided for specific columns; the provided dict can map either an Integer for column number or a String, Symbol or Regex for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.","category":"page"},{"location":"reading.html#Examples-16","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"DateFormat","category":"page"},{"location":"reading.html#decimal","page":"Reading","title":"decimal","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An ASCII Char argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14, the '.' character separates the 3 and 14 values, whereas for 3,14 (common European notation), the ',' character separates the fractional portion. By default, decimal='.'.","category":"page"},{"location":"reading.html#Examples-17","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Custom decimal separator","category":"page"},{"location":"reading.html#groupmark","page":"Reading","title":"groupmark / thousands separator","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A \"groupmark\" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark, if provided, must be an ASCII Char which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729 is ',' and the groupmark for the US social security number 875-39-3196 is -. By default, groupmark=nothing which indicates that there are no stray characters separating digits.","category":"page"},{"location":"reading.html#Examples-18","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Thousands separator\nCustom groupmarks","category":"page"},{"location":"reading.html#truestrings","page":"Reading","title":"truestrings / falsestrings","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"These arguments can be provided as Vector{String} to specify custom values that should be treated as the Bool true/false values for all the columns of a data input. By default, [\"true\", \"True\", \"TRUE\", \"T\", \"1\"] string values are used to detect true values, and [\"false\", \"False\", \"FALSE\", \"F\", \"0\"] string values are used to detect false values. Note that even though \"1\" and \"0\" can be used to parse true/false values, in terms of auto detecting column types, those values will be parsed as Int64 first, instead of Bool. To instead parse those values as Bools for a column, you can manually provide that column's type as Bool (see the type argument).","category":"page"},{"location":"reading.html#Examples-19","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Custom bool strings","category":"page"},{"location":"reading.html#types","page":"Reading","title":"types","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type, an AbstractVector of types, an AbstractDict, or a function.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"If a single type is provided, like types=Float64, then all columns in the data input will be parsed as Float64. If a column's value isn't a valid Float64 value, then a warning will be emitted, unless silencewarnings=false is passed, then no warning will be printed. However, if strict=true is passed, then an error will be thrown instead, regarldess of the silencewarnings argument.\nIf a AbstractVector{Type} is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.\nIf an AbstractDict, then specific columns can have their column type specified with the key of the dict being an Integer for column number, or String or Symbol for column name or Regex matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.\nIf a function, then it should be of the form (i, name) -> Union{T, Nothing}, and will be applied to each detected column during initial parsing. Returning nothing from the function will result in the column's type being automatically detected during parsing.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"By default types=nothing, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass types=Union{Float64, Missing} if the data input contains missing values. Parsing will detect missing values if present, and promote any manually provided column types from the singular (Float64) to the missing equivalent (Union{Float64, Missing}) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64, Float64, Date, DateTime, Time, Bool, String.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Non-standard types can be provided, like Dec64 from the DecFP.jl package, but must support the Base.tryparse(T, str) function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end, as long as a corresponding Base.tryparse definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';'))), where a single cell in the data input is like 1.23;4.56;7.89.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Note that the default stringtype can be overridden by providing a column's type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString), where the first column will be parsed as a String, while any other string columns will have the PosLenString type.","category":"page"},{"location":"reading.html#Examples-20","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Matrix-like Data\nProviding types","category":"page"},{"location":"reading.html#typemap","page":"Reading","title":"typemap","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An AbstractDict{Type, Type} argument that allows replacing a non-String standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64, like typemap=IdDict(Int64 => Float64), which would cause any columns detected as Int64 to be parsed as Float64 instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String), which will cause any columns detected as Date to be parsed as String instead.","category":"page"},{"location":"reading.html#Examples-21","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Typemap","category":"page"},{"location":"reading.html#pool","page":"Reading","title":"pool","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Argument that controls whether columns will be returned as PooledArrays. Can be provided as a Bool, Float64, Tuple{Float64, Int}, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}. As a Bool, controls absolutely whether a column will be pooled or not; if passed as a single Bool argument like pool=true, then all string columns will be pooled, regardless of cardinality. When passed as a Float64, the value should be between 0.0 and 1.0 to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1, then all string columns with a unique value % less than 10% will be returned as PooledArray, while other string columns will be normal string vectors. If pool is provided as a tuple, like (0.2, 500), the first tuple element is the same as a single Float64 value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500) means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the pool argument is a single Bool, Real, or Tuple{Float64, Int}, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool, Float64, or Tuple{Float64, Int}. Similar to the types argument, providing a vector to pool should have an element for each column in the data input, while a dict argument can map column number/name to Bool, Float64, or Tuple{Float64, Int} for specific columns. Unspecified columns will not be pooled when the argument is a dict.","category":"page"},{"location":"reading.html#Examples-22","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Pooled values\nNon-string column pooling\nPool with absolute threshold","category":"page"},{"location":"reading.html#downcast","page":"Reading","title":"downcast","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument that controls whether Integer detected column types will be \"shrunk\" to the smallest possible integer type. Argument is false by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.","category":"page"},{"location":"reading.html#stringtype","page":"Reading","title":"stringtype","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An argument that controls the precise type of string columns. Supported values are InlineString (the default), PosLenString, or String. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"InlineString: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than String. When string length has a high variance, it can lead to lots of \"wasted space\", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.\nPosLenString: results in columns returned as PosLenStringVector (or ChainedVector{PosLenStringVector} for the multithreaded case), which holds a reference to the original input data, and acts as one large \"view\" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns. PosLenStringVector, however, does not support traditional mutable operations like regular Vectors, like push!, append!, or deleteat!.\nString: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normal Vector{String} (or ChainedVector{Vector{String}}), which can be processed normally, including any mutating operations.","category":"page"},{"location":"reading.html#strict","page":"Reading","title":"strict / silencewarnings / maxwarnings","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing, emit a warning (i.e. silencewarnings=false and strict=false), but only up to 100 total warnings and then they'll be silenced (i.e. maxwarnings=100). If strict=true, then invalid values will result in an error being thrown instead of any warnings emitted.","category":"page"},{"location":"reading.html#debug","page":"Reading","title":"debug","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument that controls the printing of extra \"debug\" information while parsing. Can be useful if parsing doesn't produce the expected result or a bug is suspected in parsing somehow.","category":"page"},{"location":"reading.html#API-Reference","page":"Reading","title":"API Reference","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"CSV.read\nCSV.File\nCSV.Chunks\nCSV.Rows","category":"page"},{"location":"reading.html#CSV.read","page":"Reading","title":"CSV.read","text":"CSV.read(source, sink::T; kwargs...) => T\n\nRead and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.\n\nExample\n\njulia> using CSV, DataFrames\n\njulia> path = tempname();\n\njulia> write(path, \"a,b,c\\n1,2,3\");\n\njulia> CSV.read(path, DataFrame)\n1×3 DataFrame\n Row │ a b c\n │ Int64 Int64 Int64\n─────┼─────────────────────\n 1 │ 1 2 3\n\njulia> CSV.read(path, DataFrame; header=false)\n2×3 DataFrame\n Row │ Column1 Column2 Column3\n │ String1 String1 String1\n─────┼───────────────────────────\n 1 │ a b c\n 2 │ 1 2 3\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"function"},{"location":"reading.html#CSV.File","page":"Reading","title":"CSV.File","text":"CSV.File(input; kwargs...) => CSV.File\n\nRead a UTF-8 CSV input and return a CSV.File object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...) instead if the CSV.File intermediate object isn't needed.\n\nThe input argument can be one of:\n\nfilename given as a string or FilePaths.jl type\na Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer\na CodeUnits object, which wraps a String, like codeunits(str)\na csv-formatted string can also be passed like IOBuffer(str)\na Cmd or other IO\na gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing\na Vector of any of the above, which will parse and vertically concatenate each source, returning a single, \"long\" CSV.File\n\nTo read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response body can be passed like:\n\nusing Downloads, CSV\nf = CSV.File(Downloads.download(url))\n\n# or\n\nusing HTTP, CSV\nf = CSV.File(HTTP.get(url).body)\n\nOpens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector for all columns, or specified per column via name or index in a Dict).\n\nWhen a Vector of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using ChainedVectors to lazily concatenate each thread's columns.\n\nFor text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc\"ISO-8859-1\")).\n\nThe returned CSV.File object supports the Tables.jl interface and can iterate CSV.Rows. CSV.Row supports propertynames and getproperty to access individual row values. CSV.File also supports entire column access like a DataFrame via direct property access on the file object, like f = CSV.File(file); f.col1. Or by getindex access with column names, like f[:col1] or f[\"col1\"]. The returned columns are AbstractArray subtypes, including: SentinelVector (for integers), regular Vector, PooledVector for pooled columns, MissingVector for columns of all missing values, PosLenStringVector when stringtype=PosLenString is passed, and ChainedVector will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single \"chain\" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:\n\nfor row in CSV.File(file)\n println(\"a=$(row.a), b=$(row.b), c=$(row.c)\")\nend\n\nBy supporting the Tables.jl interface, a CSV.File can also be a table input to any other table sink function. Like:\n\n# materialize a csv file as a DataFrame, copying columns from CSV.File\ndf = CSV.File(file) |> DataFrame\n\n# to avoid making a copy of parsed columns, use CSV.read\ndf = CSV.read(file, DataFrame)\n\n# load a csv file directly into an sqlite database table\ndb = SQLite.DB()\ntbl = CSV.File(file) |> SQLite.load!(db, \"sqlite_table\")\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"type"},{"location":"reading.html#CSV.Chunks","page":"Reading","title":"CSV.Chunks","text":"CSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks\n\nReturns a file \"chunk\" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.\n\nThe ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.\n\nEach iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.\n\nThis functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"type"},{"location":"reading.html#CSV.Rows","page":"Reading","title":"CSV.Rows","text":"CSV.Rows(source; kwargs...) => CSV.Rows\n\nRead a csv input returning a CSV.Rows object.\n\nThe input argument can be one of:\n\nfilename given as a string or FilePaths.jl type\na Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer\na CodeUnits object, which wraps a String, like codeunits(str)\na csv-formatted string can also be passed like IOBuffer(str)\na Cmd or other IO\na gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing\n\nTo read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:\n\nf = CSV.Rows(HTTP.get(url).body)\n\nFor other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).\n\nWhile similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:\n\nVery minimal memory footprint; while iterating, only the current row values are buffered\nOnly provides row access via iteration; to access columns, one can stream the rows into a table type\nPerforms no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments\n\nOpens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:\n\nfor row in CSV.Rows(file)\n println(\"a=$(row.a), b=$(row.b), c=$(row.c)\")\nend\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"type"},{"location":"reading.html#Utilities","page":"Reading","title":"Utilities","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"CSV.detect","category":"page"},{"location":"reading.html#CSV.detect","page":"Reading","title":"CSV.detect","text":"CSV.detect(str::String)\n\nUse the same logic used by CSV.File to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows type, which returns each cell of a file as a String. The order of types attempted is: Int, Float64, Date, DateTime, Bool, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options type as a keyword argument option=ops for sentinel value detection.\n\n\n\n\n\n","category":"function"},{"location":"reading.html#Common-terms","page":"Reading","title":"Common terms","text":"","category":"section"},{"location":"reading.html#Standard-types","page":"Reading","title":"Standard types","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"The types that are detected by default when column types are not provided by the user otherwise. They include: Int64, Float64, Date, DateTime, Time, Bool, and String.","category":"page"},{"location":"reading.html#newlines","page":"Reading","title":"Newlines","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"For all parsing functionality, newlines are detected/parsed automatically, regardless if they're present in the data as a single newline character ('\\n'), single return character ('\\r'), or full CRLF sequence (\"\\r\\n\").","category":"page"},{"location":"reading.html#Cardinality","page":"Reading","title":"Cardinality","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Refers to the ratio of unique values to total number of values in a column. Columns with \"low cardinality\" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with \"high cardinality\" have a high % of unique values relative to total number of values. Think of these as \"id-like\" columns where each or almost each value is a unique identifier with no (or few) repeated values.","category":"page"},{"location":"writing.html#Writing","page":"Writing","title":"Writing","text":"","category":"section"},{"location":"writing.html","page":"Writing","title":"Writing","text":"CSV.write\nCSV.RowWriter","category":"page"},{"location":"writing.html#CSV.write","page":"Writing","title":"CSV.write","text":"CSV.write(file, table; kwargs...) => file\ntable |> CSV.write(file; kwargs...) => file\n\nWrite a Tables.jl interface input to a csv file, given as an IO argument or String/FilePaths.jl type representing the file name to write to. Alternatively, CSV.RowWriter creates a row iterator, producing a csv-formatted string for each row in an input table.\n\nSupported keyword arguments include:\n\nbufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown\ndelim::Union{Char, String}=',': a character or string to print out as the file's delimiter\nquotechar::Char='\"': ascii character to use for quoting text fields that may contain delimiters or newlines\nopenquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters\nescapechar::Char='\"': ascii character used to escape quote characters in a text field\nmissingstring::String=\"\": string to print for missing values\ndateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns\nappend=false: whether to append writing to an existing file/IO, if true, it will not write column names by default\ncompress=false: compress the written output using standard gzip compression (provided by the CodecZlib.jl package); note that a compression stream can always be provided as the first \"file\" argument to support other forms of compression; passing compress=true is just for convenience to avoid needing to manually setup a GzipCompressorStream\nwriteheader=!append: whether to write an initial row of delimited column names, not written by default if appending\nheader: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table\nnewline='\\n': character or string to use to separate rows (lines in the csv file)\nquotestrings=false: whether to force all strings to be quoted or not\ndecimal='.': character to use as the decimal point when writing floating point numbers\ntransform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)\nbom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not\npartition::Bool=false: by passing true, the table argument is expected to implement Tables.partitions and the file argument can either be an indexable collection of IO, file Strings, or a single file String that will have an index appended to the name\n\nExamples\n\nusing CSV, Tables, DataFrames\n\n# write out a DataFrame to csv file\ndf = DataFrame(rand(10, 10), :auto)\nCSV.write(\"data.csv\", df)\n\n# write a matrix to an in-memory IOBuffer\nio = IOBuffer()\nmat = rand(10, 10)\nCSV.write(io, Tables.table(mat))\n\n\n\n\n\n","category":"function"},{"location":"writing.html#CSV.RowWriter","page":"Writing","title":"CSV.RowWriter","text":"CSV.RowWriter(table; kwargs...)\n\nCreates an iterator that produces csv-formatted strings for each row in the input table.\n\nSupported keyword arguments include:\n\nbufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown\ndelim::Union{Char, String}=',': a character or string to print out as the file's delimiter\nquotechar::Char='\"': ascii character to use for quoting text fields that may contain delimiters or newlines\nopenquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters\nescapechar::Char='\"': ascii character used to escape quote characters in a text field\nmissingstring::String=\"\": string to print for missing values\ndateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns\nheader: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table\nnewline='\\n': character or string to use to separate rows (lines in the csv file)\nquotestrings=false: whether to force all strings to be quoted or not\ndecimal='.': character to use as the decimal point when writing floating point numbers\ntransform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)\nbom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not\n\n\n\n\n\n","category":"type"}]
+[{"location":"examples.html#Examples","page":"Examples","title":"Examples","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"Pages = [\"examples.md\"]","category":"page"},{"location":"examples.html#stringencodings","page":"Examples","title":"Non-UTF-8 character encodings","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"# assume I have csv text data encoded in ISO-8859-1 encoding\n# I load the StringEncodings package, which provides encoding conversion functionality\nusing CSV, StringEncodings\n\n# I open my `iso8859_encoded_file.csv` with the `enc\"ISO-8859-1\"` encoding\n# and pass the opened IO object to `CSV.File`, which will read the entire\n# input into a temporary file, then parse the data from the temp file\nfile = CSV.File(open(\"iso8859_encoded_file.csv\", enc\"ISO-8859-1\"))\n\n# to instead have the encoding conversion happen in memory, pass\n# `buffer_in_memory=true`; this can be faster, but obviously results\n# in more memory being used rather than disk via a temp file\nfile = CSV.File(open(\"iso8859_encoded_file.csv\", enc\"ISO-8859-1\"); buffer_in_memory=true)","category":"page"},{"location":"examples.html#vectorinputs","page":"Examples","title":"Concatenate multiple inputs at once","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, I have a vector of delimited data inputs that each have\n# matching schema (the same column names and types). I'd like to process all\n# of the inputs together and vertically concatenate them into one \"long\" table.\ndata = [\n \"a,b,c\\n1,2,3\\n4,5,6\\n\",\n \"a,b,c\\n7,8,9\\n10,11,12\\n\",\n \"a,b,c\\n13,14,15\\n16,17,18\",\n]\n\n# I can just pass a `Vector` of inputs, in this case `IOBuffer(::String)`, but it\n# could also be a `Vector` of any valid input source, like `AbstractVector{UInt8}`,\n# filenames, `IO`, etc. Each input will be processed on a separate thread, with the results\n# being vertically concatenated afterwards as a single `CSV.File`. Each thread's columns\n# will be lazily concatenated using the `ChainedVector` type. As always, if we want to\n# send the parsed columns directly to a sink function, we can use `CSV.read`, like\n# `df = CSV.read(map(IOBuffer, data), DataFrame)`.\nf = CSV.File(map(IOBuffer, data))","category":"page"},{"location":"examples.html#gzipped_input","page":"Examples","title":"Gzipped input","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"# assume I have csv text data compressed via gzip\n# no additional packages are needed; CSV.jl can decompress automatically\nusing CSV\n\n# pass name of gzipped input file directly; data will be decompressed to a\n# temporary file, then mmapped as a byte buffer for actual parsing\nfile = CSV.File(\"data.gz\")\n\n# to instead have the decompression happen in memory, pass\n# `buffer_in_memory=true`; this can be faster, but obviously results\n# in more memory being used rather than disk via a temp file\nfile = CSV.File(\"data.gz\"; buffer_in_memory=true)","category":"page"},{"location":"examples.html#csv_string","page":"Examples","title":"Delimited data in a string","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# I have csv data in a string I want to parse\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n\"\"\"\n\n# Calling `IOBuffer` on a string returns an in-memory IO object\n# of the string data, which can be passed to `CSV.File` for parsing\nfile = CSV.File(IOBuffer(data))","category":"page"},{"location":"examples.html#http","page":"Examples","title":"Data from the web/a url","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"# assume there's delimited data I want to read from the web\n# one option is to use the HTTP.jl package\nusing CSV, HTTP\n\n# I first make the web request to get the data via `HTTP.get` on the `url`\nhttp_response = HTTP.get(url)\n\n# I can then access the data of the response as a `Vector{UInt8}` and pass\n# it directly to `CSV.File` for parsing\nfile = CSV.File(http_response.body)\n\n# another option, with Julia 1.6+, is using the Downloads stdlib\nusing Downloads\nhttp_response = Downloads.download(url)\n\n# by default, `Downloads.download` writes the response data to a temporary file\n# which can then be passed to `CSV.File` for parsing\nfile = CSV.File(http_response)","category":"page"},{"location":"examples.html#zip_example","page":"Examples","title":"Reading from a zip file","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using ZipFile, CSV, DataFrames\n\na = DataFrame(a = 1:3)\nCSV.write(\"a.csv\", a)\n\n# zip the file; Windows users who do not have zip available on the PATH can manually zip the CSV\n# or write directly into the zip archive as shown below\n;zip a.zip a.csv\n\n# alternatively, write directly into the zip archive (without creating an unzipped csv file first)\nz = ZipFile.Writer(\"a2.zip\")\nf = ZipFile.addfile(z, \"a.csv\", method=ZipFile.Deflate)\na |> CSV.write(f)\nclose(z)\n\n# read file from zip archive\nz = ZipFile.Reader(\"a.zip\") # or \"a2.zip\"\n\n# identify the right file in zip\na_file_in_zip = filter(x->x.name == \"a.csv\", z.files)[1]\n\na_copy = CSV.File(a_file_in_zip) |> DataFrame\n\na == a_copy\n\nclose(z)","category":"page"},{"location":"examples.html#second_row_header","page":"Examples","title":"Column names on 2nd row","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\ndata = \"\"\"\ndescriptive row with information about the file that we'd like to ignore\na,b,c\n1,2,3\n4,5,6\n\"\"\"\n\n# by passing header=2, parsing will ignore the 1st row entirely\n# then parse the column names on row 2, then by default, it assumes\n# the data starts on the row after the column names (row 3 in this case)\n# which is correct for this case\nfile = CSV.File(IOBuffer(data); header=2)","category":"page"},{"location":"examples.html#no_header","page":"Examples","title":"No column names in data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our data doesn't have any column names\ndata = \"\"\"\n1,2,3\n4,5,6\n\"\"\"\n\n# by passing `header=false`, parsing won't worry about looking for column names\n# anywhere, but instead just start parsing the data and generate column names\n# as needed, like `Column1`, `Column2`, and `Column3` in this case\nfile = CSV.File(IOBuffer(data); header=false)","category":"page"},{"location":"examples.html#manual_header","page":"Examples","title":"Manually provide column names","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our data doesn't have any column names\ndata = \"\"\"\n1,2,3\n4,5,6\n\"\"\"\n\n# instead of passing `header=false` and getting auto-generated column names,\n# we can instead pass the column names ourselves\nfile = CSV.File(IOBuffer(data); header=[\"a\", \"b\", \"c\"])\n\n# we can also pass the column names as Symbols; a copy of the manually provided\n# column names will always be made and then converted to `Vector{Symbol}`\nfile = CSV.File(IOBuffer(data); header=[:a, :b, :c])","category":"page"},{"location":"examples.html#multi_row_header","page":"Examples","title":"Multi-row column names","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our column names are `col_a`, `col_b`, and `col_c`,\n# but split over the first and second rows\ndata = \"\"\"\ncol,col,col\na,b,c\n1,2,3\n4,5,6\n\"\"\"\n\n# by passing a collection of integers, parsing will parse each row in the collection\n# and concatenate the values for each column, separating rows with `_` character\nfile = CSV.File(IOBuffer(data); header=[1, 2])","category":"page"},{"location":"examples.html#normalize_header","page":"Examples","title":"Normalizing column names","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this case, our data are single letters, with column names of \"1\", \"2\", and \"3\"\n# A single digit isn't a valid identifier in Julia, meaning we couldn't do something\n# like `1 = 2 + 2`, where `1` would be a variable name\ndata = \"\"\"\n1,2,3\na,b,c\nd,e,f\nh,i,j\n\"\"\"\n\n# in order to have valid identifiers for column names, we can pass\n# `normalizenames=true`, which result in our column names becoming \"_1\", \"_2\", and \"_3\"\n# note this isn't required, but can be convenient in certain cases\nfile = CSV.File(IOBuffer(data); normalizenames=true)\n\n# we can access the first column like\nfile._1\n\n# another example where we may want to normalize is column names with spaces in them\ndata = \"\"\"\ncolumn one,column two, column three\n1,2,3\n4,5,6\n\"\"\"\n\n# normalizing will result in column names like \"column_one\", \"column_two\" and \"column_three\"\nfile = CSV.File(IOBuffer(data); normalizenames=true)","category":"page"},{"location":"examples.html#skipto_example","page":"Examples","title":"Skip to specific row where data starts","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data has a first row that we'd like to ignore; our data also doesn't have\n# column names, so we'd like them to be auto-generated\ndata = \"\"\"\ndescriptive row that gives information about the data that we'd like to ignore\n1,2,3\n4,5,6\n\"\"\"\n\n# with no column names in the data, we first pass `header=false`; by itself,\n# this would result in parsing starting on row 1 to parse the actual data;\n# but we'd like to ignore the first row, so we pass `skipto=2` to skip over\n# the first row; our colum names will be generated like `Column1`, `Column2`, `Column3`\nfile = CSV.File(IOBuffer(data); header=false, skipto=2)","category":"page"},{"location":"examples.html#footerskip_example","page":"Examples","title":"Skipping trailing useless rows","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data has column names of \"a\", \"b\", and \"c\"\n# but at the end of the data, we have 2 rows we'd like to ignore while parsing\n# since they're not properly delimited\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n7,8,9\ntotals: 12, 15, 18\ngrand total: 45\n\"\"\"\n\n# by passing `footerskip=2`, we tell parsing to start the end of the data and\n# read 2 rows, ignoring their contents, then mark the ending position where\n# the normal parsing process should finish\nfile = CSV.File(IOBuffer(data); footerskip=2)","category":"page"},{"location":"examples.html#transpose_example","page":"Examples","title":"Reading transposed data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data is transposed, meaning our column names are in the first column,\n# with the data for column \"a\" all on the first row, data for column \"b\"\n# all on the second row, and so on.\ndata = \"\"\"\na,1,4,7\nb,2,5,8\nc,3,6,9\n\"\"\"\n\n# by passing `transpose=true`, parsing will look for column names in the first\n# column of data, then parse each row as a separate column\nfile = CSV.File(IOBuffer(data); transpose=true)","category":"page"},{"location":"examples.html#comment_example","page":"Examples","title":"Ignoring commented rows","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# here, we have several non-data rows that all begin with the \"#\" string\ndata = \"\"\"\n# row describing column names\na,b,c\n# row describing first row of data\n1,2,3\n# row describing second row of data\n4,5,6\n\"\"\"\n\n# we want to ignore these \"commented\" rows\nfile = CSV.File(IOBuffer(data); comment=\"#\")","category":"page"},{"location":"examples.html#ignoreemptyrows_example","page":"Examples","title":"Ignoring empty rows","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# here, we have a \"gap\" row in between the first and second row of data\n# by default, these \"empty\" rows are ignored, but in our case, this is\n# how a row of data is input when all columns have missing/null values\n# so we don't want those rows to be ignored so we can know how many\n# missing cases there are in our data\ndata = \"\"\"\na,b,c\n1,2,3\n\n4,5,6\n\"\"\"\n\n# by passing `ignoreemptyrows=false`, we ensure parsing treats an empty row\n# as each column having a `missing` value set for that row\nfile = CSV.File(IOBuffer(data); ignoreemptyrows=true)","category":"page"},{"location":"examples.html#select_example","page":"Examples","title":"Including/excluding columns","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# simple dataset, but we know column \"b\" isn't needed\n# so we'd like to save time by having parsing ignore it completely\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n7,8,9\n\"\"\"\n\n# there are quite a few ways to provide the select/drop arguments\n# so we provide an example of each, first for selecting the columns\n# \"a\" and \"c\" that we want to include or keep from parsing\nfile = CSV.File(IOBuffer(data); select=[1, 3])\nfile = CSV.File(IOBuffer(data); select=[:a, :c])\nfile = CSV.File(IOBuffer(data); select=[\"a\", \"c\"])\nfile = CSV.File(IOBuffer(data); select=[true, false, true])\nfile = CSV.File(IOBuffer(data); select=(i, nm) -> i in (1, 3))\n# now examples of dropping, when we'd rather specify the column(s)\n# we'd like to drop/exclude from parsing\nfile = CSV.File(IOBuffer(data); drop=[2])\nfile = CSV.File(IOBuffer(data); drop=[:b])\nfile = CSV.File(IOBuffer(data); drop=[\"b\"])\nfile = CSV.File(IOBuffer(data); drop=[false, true, false])\nfile = CSV.File(IOBuffer(data); drop=(i, nm) -> i == 2)","category":"page"},{"location":"examples.html#limit_example","page":"Examples","title":"Limiting number of rows from data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# here, we have quite a few rows of data (relative to other examples, lol)\n# but we know we only need the first 3 for the analysis we need to do\n# so instead of spending the time parsing the entire file, we'd like\n# to just read the first 3 rows and ignore the rest\ndata = \"\"\"\na,b,c\n1,2,3\n4,5,6\n7,8,9\n10,11,12\n13,14,15\n\"\"\"\n\n# parsing will start reading rows, and once 3 have been read, it will\n# terminate early, avoiding the parsing of the rest of the data entirely\nfile = CSV.File(IOBuffer(data); limit=3)","category":"page"},{"location":"examples.html#missing_string_example","page":"Examples","title":"Specifying custom missing strings","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this data, our first column has \"missing\" values coded with -999\n# but our score column has \"NA\" instead\n# we'd like either of those values to show up as `missing` after we parse the data\ndata = \"\"\"\ncode,age,score\n0,21,3.42\n1,42,6.55\n-999,81,NA\n-999,83,NA\n\"\"\"\n\n# by passing missingstring=[\"-999\", \"NA\"], parsing will check each cell if it matches\n# either string in order to set the value of the cell to `missing`\nfile = CSV.File(IOBuffer(data); missingstring=[\"-999\", \"NA\"])","category":"page"},{"location":"examples.html#string_delim","page":"Examples","title":"String delimiter","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# our data has two columns, separated by double colon\n# characters (\"::\")\ndata = \"\"\"\ncol1::col2\n1::2\n3::4\n\"\"\"\n\n# we can pass a single character or string for delim\nfile = CSV.File(IOBuffer(data); delim=\"::\")","category":"page"},{"location":"examples.html#ignorerepeated_example","page":"Examples","title":"Fixed width files","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# This is an example of \"fixed width\" data, where each\n# column is the same number of characters away from each\n# other on each row. Fields are \"padded\" with extra\n# delimiters (in this case `' '`) so that each column is\n# the same number of characters each time\ndata = \"\"\"\ncol1 col2 col3\n123431 2 3421\n2355 346 7543\n\"\"\"\n# In addition to our `delim`, we can pass\n# `ignorerepeated=true`, which tells parsing that\n#consecutive delimiters should be treated as a single\n# delimiter.\nfile = CSV.File(IOBuffer(data); delim=' ', ignorerepeated=true)","category":"page"},{"location":"examples.html#quoted_example","page":"Examples","title":"Turning off quoted cell parsing","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# by default, cells like the 1st column, 2nd row\n# will be treated as \"quoted\" cells, where they start\n# and end with the quote character '\"'. The quotes will\n# be removed from the final parsed value\n# we may, however, want the \"raw\" value and _not_ ignore\n# the quote characters in the final value\ndata = \"\"\"\na,b,c\n\"hey\",2,3\nthere,4,5\nsailor,6,7\n\"\"\"\n\n# we can \"turn off\" the detection of quoted cells\n# by passing `quoted=false`\nfile = CSV.File(IOBuffer(data); quoted=false)","category":"page"},{"location":"examples.html#quotechar_example","page":"Examples","title":"Quoted & escaped fields","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this data, we have a few \"quoted\" fields, which means the field's value starts and ends with `quotechar` (or\n# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise\n# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise\n# significant characters until the closing quote character is found. For quoted fields that need to also include the quote\n# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote\n# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default\n# values, so just doing `CSV.File(IOBuffer(data))` would result in successful parsing.\ndata = \"\"\"\ncol1,col2\n\"quoted field with a delimiter , inside\",\"quoted field that contains a \\\\n newline and \"\"inner quotes\\\"\\\"\\\"\nunquoted field,unquoted field with \"inner quotes\"\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); quotechar='\"', escapechar='\"')\n\nfile = CSV.File(IOBuffer(data); openquotechar='\"' closequotechar='\"', escapechar='\"')","category":"page"},{"location":"examples.html#dateformat_example","page":"Examples","title":"DateFormat","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, our `date` column has dates that are formatted like `yyyy/mm/dd`. We can pass just such a string to the\n# `dateformat` keyword argument to tell parsing to use it when looking for `Date` or `DateTime` columns. Note that currently,\n# only a single `dateformat` string can be passed to parsing, meaning multiple columns with different date formats cannot all\n# be parsed as `Date`/`DateTime`.\ndata = \"\"\"\ncode,date\n0,2019/01/01\n1,2019/01/02\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); dateformat=\"yyyy/mm/dd\")","category":"page"},{"location":"examples.html#decimal_example","page":"Examples","title":"Custom decimal separator","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In many places in the world, floating point number decimals are separated with a comma instead of a period (`3,14` vs. `3.14`)\n# . We can correctly parse these numbers by passing in the `decimal=','` keyword argument. Note that we probably need to\n# explicitly pass `delim=';'` in this case, since the parser will probably think that it detected `','` as the delimiter.\ndata = \"\"\"\ncol1;col2;col3\n1,01;2,02;3,03\n4,04;5,05;6,06\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); delim=';', decimal=',')","category":"page"},{"location":"examples.html#thousands_example","page":"Examples","title":"Thousands separator","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In many places in the world, digits to the left of the decimal place are broken into\n# groups by a thousands separator. We can ignore those separators by passing the `groupmark`\n# keyword argument.\ndata = \"\"\"\nx y\n1 2\n2 1,729\n3 87,539,319\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); groupmark=',')","category":"page"},{"location":"examples.html#groupmark_example","page":"Examples","title":"Custom groupmarks","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In some contexts, separators other than thousands separators group digits in a number.\n# `groupmark` supports ignoring them as long as the separator character is ASCII\ndata = \"\"\"\nname;ssn;credit card number\nAyodele Beren;597-21-8366;5538-6111-0574-2633\nTrinidad Shiori;387-35-5126;3017-9300-0776-5301\nOri Cherokee;731-12-4606;4682-5416-0636-3877\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); groupmark='-')","category":"page"},{"location":"examples.html#truestrings_example","page":"Examples","title":"Custom bool strings","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# By default, parsing only considers the string values `true` and `false` as valid `Bool` values. To consider alternative\n# values, we can pass a `Vector{String}` to the `truestrings` and `falsestrings` keyword arguments.\ndata = \"\"\"\nid,paid,attended\n0,T,TRUE\n1,F,TRUE\n2,T,FALSE\n3,F,FALSE\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); truestrings=[\"T\", \"TRUE\"], falsestrings=[\"F\", \"FALSE\"])","category":"page"},{"location":"examples.html#matrix_example","page":"Examples","title":"Matrix-like Data","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can\n# also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`,\n# without having to guess the type on its own.\ndata = \"\"\"\n1.0 0.0 0.0\n0.0 1.0 0.0\n0.0 0.0 1.0\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); header=false)\nfile = CSV.File(IOBuffer(data); header=false, delim=' ', types=Float64)\n\n# as a last step if you want to convert this to a Matrix, this can be done by reading in first as a DataFrame and then\n# function chaining to a Matrix\nusing DataFrames\nA = file|>DataFrame|>Matrix\n\n# another alternative is to simply use CSV.Tables.matrix and say\nB = file|>CSV.Tables.matrix # does not require DataFrames","category":"page"},{"location":"examples.html#types_example","page":"Examples","title":"Providing types","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, our 3rd column has an invalid value on the 2nd row `invalid`. Let's imagine we'd still like to treat it as an\n# `Int` column, and ignore the `invalid` value. The syntax examples provide several ways we can tell parsing to treat the 3rd\n# column as `Int`, by referring to column index `3`, or column name with `Symbol` or `String`. We can also provide an entire\n# `Vector` of types for each column (and which needs to match the length of columns in the file). There are two additional\n# keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like\n# `\"warning: invalid Int64 value on row 2, column 3\"`. In the fifth example, passing `silencewarnings=true` will suppress this\n# warning printing. In the last syntax example, passing `strict=true` will result in an error being thrown during parsing.\ndata = \"\"\"\ncol1,col2,col3\n1,2,3\n4,5,invalid\n6,7,8\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); types=Dict(3 => Int))\nfile = CSV.File(IOBuffer(data); types=Dict(:col3 => Int))\nfile = CSV.File(IOBuffer(data); types=Dict(\"col3\" => Int))\nfile = CSV.File(IOBuffer(data); types=[Int, Int, Int])\nfile = CSV.File(IOBuffer(data); types=[Int, Int, Int], silencewarnings=true)\nfile = CSV.File(IOBuffer(data); types=[Int, Int, Int], strict=true)\n\n\n# In this file we have lots of columns, and would like to specify the same type for all\n# columns except one which should have a different type. We can do this by providing a\n# function that takes the column index and column name and uses these to decide the type.\ndata = \"\"\"\ncol1,col2,col3,col4,col5,col6,col7\n1,2,3,4,5,6,7\n0,2,3,4,5,6,7\n1,2,3,4,5,6,7\n\"\"\"\nfile = CSV.File(IOBuffer(data); types=(i, name) -> i == 1 ? Bool : Int8)\nfile = CSV.File(IOBuffer(data); types=(i, name) -> name == :col1 ? Bool : Int8)\n# Alternatively by providing the exact name for the first column and a Regex to match the rest.\n# Note that an exact column name always takes precedence over a regular expression.\nfile = CSV.File(IOBuffer(data); types=Dict(:col1 => Bool, r\"^col\\d\" => Int8))","category":"page"},{"location":"examples.html#typemap_example","page":"Examples","title":"Typemap","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, we have U.S. zipcodes in the first column that we'd rather not treat as `Int`, but parsing will detect it as\n# such. In the first syntax example, we pass `typemap=IdDict(Int => String)`, which tells parsing to treat any detected `Int`\n# columns as `String` instead. In the second syntax example, we alternatively set the `zipcode` column type manually.\ndata = \"\"\"\nzipcode,score\n03494,9.9\n12345,6.7\n84044,3.4\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); typemap=IdDict(Int => String))\nfile = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))","category":"page"},{"location":"examples.html#pool_example","page":"Examples","title":"Pooled values","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations\n# like joining and grouping when `String` values are \"pooled\", meaning each unique value is mapped to a `UInt32`. By default,\n# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide\n# greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.\ndata = \"\"\"\nid,code\nA18E9,AT\nBF392,GC\n93EBC,AT\n54EE1,AT\n8CD2E,GC\n\"\"\"\n\nfile = CSV.File(IOBuffer(data))\nfile = CSV.File(IOBuffer(data); pool=0.4)\nfile = CSV.File(IOBuffer(data); pool=0.6)","category":"page"},{"location":"examples.html#nonstring_pool_example","page":"Examples","title":"Non-string pooled values","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# in this data, our `category` column is an integer type, but represents a limited set of values that could benefit from\n# pooling. Indeed, we may want to do various DataFrame grouping/joining operations on the column, which can be more\n# efficient if the column type is a PooledVector. By default, passing `pool=true` will only pool string column types,\n# if we pass a vector or dict however, we can specify how specific, non-string type, columns should be pooled.\ndata = \"\"\"\ncategory,amount\n1,100.01\n1,101.10\n2,201.10\n2,202.40\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); pool=Dict(1 => true))\nfile = CSV.File(IOBuffer(data); pool=[true, false])","category":"page"},{"location":"examples.html#pool_absolute_threshold","page":"Examples","title":"Pool with absolute threshold","text":"","category":"section"},{"location":"examples.html","page":"Examples","title":"Examples","text":"using CSV\n\n# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations\n# like joining and grouping when `String` values are \"pooled\", meaning each unique value is mapped to a `UInt32`. By default,\n# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide\n# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.\ndata = \"\"\"\nid,code\nA18E9,AT\nBF392,GC\n93EBC,AT\n54EE1,AT\n8CD2E,GC\n\"\"\"\n\nfile = CSV.File(IOBuffer(data); pool=(0.5, 2))","category":"page"},{"location":"index.html#CSV.jl-Documentation","page":"Home","title":"CSV.jl Documentation","text":"","category":"section"},{"location":"index.html","page":"Home","title":"Home","text":"GitHub Repo: https://github.com/JuliaData/CSV.jl","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"Welcome to CSV.jl! A pure-Julia package for handling delimited text data, be it comma-delimited (csv), tab-delimited (tsv), or otherwise.","category":"page"},{"location":"index.html#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"index.html","page":"Home","title":"Home","text":"You can install CSV by typing the following in the Julia REPL:","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"] add CSV ","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"followed by ","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"using CSV","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"to load the package.","category":"page"},{"location":"index.html#Overview","page":"Home","title":"Overview","text":"","category":"section"},{"location":"index.html","page":"Home","title":"Home","text":"To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"CSV.File: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns a CSV.File object, which is like a lightweight table/DataFrame. Assuming file is a variable of a CSV.File object, individual columns can be accessed like file.col1, file[:col1], or file[\"col\"]. You can see parsed column names via file.names. A CSV.File can also be iterated, where a CSV.Row is produced on each iteration, which allows access to each value in the row via row.col1, row[:col1], or row[1]. You can also index a CSV.File directly, like file[1] to return the entire CSV.Row at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated. CSV.File satisfies the Tables.jl \"source\" interface, and so can be passed to valid sink functions like DataFrame, SQLite.load!, Arrow.write, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.\nCSV.read: a convenience function identical to CSV.File, but used when a CSV.File will be passed directly to a sink function, like a DataFrame. In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made. Keyword arguments are identical to CSV.File. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like CSV.File, a vector of data inputs can be passed as the 1st argument, which will result in a single \"long\" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).\nCSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows \"streaming\" the data with a lower memory footprint than CSV.File. Supports many of the same options as CSV.File, except column type handling is a little different. By default, every column type will be essentially Union{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a CSV.Rows object, rows can be \"streamed\" by iterating, where each iteration produces a CSV.Row2 object, which operates similar to CSV.File's CSV.Row type where individual row values can be accessed via row.col1, row[:col1], or row[1]. If each row is processed individually, additional memory can be saved by passing reusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row. CSV.Rows also supports the Tables.jl interface and can also be passed to valid sink functions.\nCSV.Chunks: similar to CSV.File, but allows passing a ntasks::Integer keyword argument which will cause the input file to be \"chunked\" up into ntasks number of chunks. After constructing a CSV.Chunks object, each iteration of the object will return a CSV.File of the next parsed chunk. Useful for processing extremely large files in \"chunks\". Because each iterated element is a valid Tables.jl \"source\", CSV.Chunks satisfies the Tables.partitions interface, so sinks that can process input partitions can operate by passing CSV.Chunks as the \"source\".\nCSV.write: A valid Tables.jl \"sink\" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out a DataFrame, for example, it's simply CSV.write(\"data.csv\", df), or to write out a matrix, it's using Tables; CSV.write(\"data.csv\", Tables.table(mat))\nCSV.RowWriter: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"That's quite a bit! Let's boil down a TL;DR:","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"Just want to read a delimited file or collection of files and do basic stuff with data? Use CSV.File(file) or CSV.read(file, DataFrame)\nDon't need the data as a whole or want to stream through a large file row-by-row? Use CSV.Rows.\nWant to process a large file in \"batches\"/chunks? Use CSV.Chunks.\nNeed to produce a csv? Use CSV.write.\nWant to iterate an input table and produce a single csv string per row? CSV.RowWriter.","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"For the rest of the manual, we're going to have two big sections, Reading and Writing where we'll walk through the various options to CSV.File/CSV.read/CSV.Rows/CSV.Chunks and CSV.write/CSV.RowWriter.","category":"page"},{"location":"index.html","page":"Home","title":"Home","text":"Pages = [\"reading.md\", \"writing.md\", \"examples.md\"]","category":"page"},{"location":"reading.html#Reading","page":"Reading","title":"Reading","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"The format for this section will go through the various inputs/options supported by CSV.File/CSV.read, with notes about compatibility with the other reading functionality (CSV.Rows, CSV.Chunks, etc.).","category":"page"},{"location":"reading.html#input","page":"Reading","title":"input","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let's look at the various supported input types:","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.\nVector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))\nIO or Cmd: you can pass an IO or Cmd directly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, pass buffer_in_memory=true.\nFor files from the web, you can call HTTP.get(url).body to request the file, then access the data as a Vector{UInt8} from the body field, which can be passed directly for parsing. For Julia 1.6+, you can also use the Downloads stdlib, like Downloads.download(url) which can be passed to parsing","category":"page"},{"location":"reading.html#Examples","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"StringEncodings.jl example\nVector of inputs example\nGzip input\nDelimited data in a string\nData from the web\nData in zip archive","category":"page"},{"location":"reading.html#header","page":"Reading","title":"header","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"The header keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1. Alternative valid augments for header include:","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Integer, e.g. header=2: provide the row number as an Integer where the column names can be found\nBool, e.g. header=false: no column names exist in the data; column names will be auto-generated depending on the # of columns, like Column1, Column2, etc.\nVector{String} or Vector{Symbol}: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of the Vector will be made and converted to Vector{Symbol}\nAbstractVector{<:Integer}: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names","category":"page"},{"location":"reading.html#Examples-2","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Column names on second row\nNo column names in the data\nManually provide column names\nMulti-row column names","category":"page"},{"location":"reading.html#normalizenames","page":"Reading","title":"normalizenames","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Controls whether column names will be \"normalized\" to valid Julia identifiers. By default, this is false. If normalizenames=true, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty, like file.col1. The identifier that comes after the . must be valid, so spaces or identifiers starting with numbers aren't allowed.","category":"page"},{"location":"reading.html#Examples-3","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Normalizing column names","category":"page"},{"location":"reading.html#skipto","page":"Reading","title":"skipto","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An Integer can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false, or column names are provided manually as Vector{String} or Vector{Symbol}, the data is assumed to start on row 1, i.e. skipto=1.","category":"page"},{"location":"reading.html#Examples-4","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Skip to specific row where data starts","category":"page"},{"location":"reading.html#footerskip","page":"Reading","title":"footerskip","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An Integer argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip # of rows have been parsed, then parsing the entire file, stopping at the newly adjusted \"end of file\".","category":"page"},{"location":"reading.html#Examples-5","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Skipping trailing useless rows","category":"page"},{"location":"reading.html#transpose","page":"Reading","title":"transpose","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"If transpose=true is passed, data will be read \"transposed\", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a \"long\" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.","category":"page"},{"location":"reading.html#Examples-6","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Reading transposed data","category":"page"},{"location":"reading.html#comment","page":"Reading","title":"comment","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A String argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header, skipto, or footerskip arguments, it should be noted that commented rows, while ignored, still count as \"rows\" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6, even if row 5 is a commented row and will be ignored.","category":"page"},{"location":"reading.html#Examples-7","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Ignoring commented rows","category":"page"},{"location":"reading.html#ignoreemptyrows","page":"Reading","title":"ignoreemptyrows","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"This argument specifies whether \"empty rows\", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false, then for an empty row, all existing columns will have missing assigned to their value for that row. Similar to commented rows, empty rows also still count as \"rows\" when any of the header, skipto, or footerskip arguments are provided.","category":"page"},{"location":"reading.html#Examples-8","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Ignoring empty rows","category":"page"},{"location":"reading.html#select","page":"Reading","title":"select / drop","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Arguments that control which columns from the input data will actually be parsed and available after processing. select controls which columns will be accessible after parsing while drop controls which columns to ignore. Either argument can be provided as a vector of Integer, String, or Symbol, specifying the column numbers or names to include/exclude. A vector of Bool matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.","category":"page"},{"location":"reading.html#Examples-9","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Including/excluding columns","category":"page"},{"location":"reading.html#limit","page":"Reading","title":"limit","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An Integer argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit number of rows. To ensure an exact limit on larger files, also pass ntasks=1 to force single-threaded parsing.","category":"page"},{"location":"reading.html#Examples-10","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Limiting number of rows from data","category":"page"},{"location":"reading.html#ntasks","page":"Reading","title":"ntasks","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"NOTE: not applicable to CSV.Rows","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"For large enough data inputs, ntasks controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads(), which is the number of threads the julia process was started with, either via julia -t N or the JULIA_NUM_THREADS environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1. This argument is only applicable to CSV.File, not CSV.Rows. For CSV.Chunks, it controls the total number of chunk iterations a large file will be split up into for parsing.","category":"page"},{"location":"reading.html#rows_to_check","page":"Reading","title":"rows_to_check","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"NOTE: not applicable to CSV.Rows","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"When input data is large enough, parsing will attempt to \"chunk\" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk's first row is found, each parser will check rows_to_check number of rows to ensure the expected number of columns are present.","category":"page"},{"location":"reading.html#source","page":"Reading","title":"source","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"NOTE: only applicable to vector of inputs passed to CSV.File","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.","category":"page"},{"location":"reading.html#missingstring","page":"Reading","title":"missingstring","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Argument to control how missing values are handled while parsing input data. The default is missingstring=\"\", which means two consecutive delimiters, like ,,, will result in a cell being set as a missing value. Otherwise, you can pass a single string to use as a \"sentinel\", like missingstring=\"NA\", or a vector of strings, where each will be checked for when parsing, like missingstring=[\"NA\", \"NAN\", \"NULL\"], and if any match, the cell will be set to missing. By passing missingstring=nothing, no missing values will be checked for while parsing.","category":"page"},{"location":"reading.html#Examples-11","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Specifying custom missing strings","category":"page"},{"location":"reading.html#delim","page":"Reading","title":"delim","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Char or String argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,) if no other delimiter can be detected consistently.","category":"page"},{"location":"reading.html#Examples-12","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"String delimiter","category":"page"},{"location":"reading.html#ignorerepeated","page":"Reading","title":"ignorerepeated","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument, default false, that, if set to true, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.","category":"page"},{"location":"reading.html#Examples-13","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Fixed width files","category":"page"},{"location":"reading.html#quoted","page":"Reading","title":"quoted","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false, so parsing avoids treating the quotechar or openquotechar/closequotechar arguments specially.","category":"page"},{"location":"reading.html#Examples-14","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Turning off quoted cell parsing","category":"page"},{"location":"reading.html#quotechar","page":"Reading","title":"quotechar / openquotechar / closequotechar","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An ASCII Char argument (or arguments if both openquotechar and closequotechar are provided) that parsing uses to handle \"quoted\" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar, or start with openquotechar and end with closequotechar so parsing knows to treat the delim or newline as part of the cell value instead of as significant parsing characters. If the quotechar or closequotechar characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.","category":"page"},{"location":"reading.html#Examples-15","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Quoted & escaped fields","category":"page"},{"location":"reading.html#escapechar","page":"Reading","title":"escapechar","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An ASCII Char argument that parsing uses when parsing quoted cells and the quotechar or closequotechar characters appear in a cell string value. If the escapechar character is encountered inside a quoted cell, it will be \"skipped\", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar is not included in the value of the cell, but is ignored completely.","category":"page"},{"location":"reading.html#dateformat","page":"Reading","title":"dateformat","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A String or AbstractDict argument that controls how parsing detects datetime values in the data input. As a single String (or DateFormat) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time, Date, or DateTime). By default, if no dateformat argument is explicitly provided, parsing will try to detect any of Time, Date, or DateTime types following the standard Dates.ISOTimeFormat, Dates.ISODateFormat, or Dates.ISODateTimeFormat formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict is provided, different dateformat strings can be provided for specific columns; the provided dict can map either an Integer for column number or a String, Symbol or Regex for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.","category":"page"},{"location":"reading.html#Examples-16","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"DateFormat","category":"page"},{"location":"reading.html#decimal","page":"Reading","title":"decimal","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An ASCII Char argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14, the '.' character separates the 3 and 14 values, whereas for 3,14 (common European notation), the ',' character separates the fractional portion. By default, decimal='.'.","category":"page"},{"location":"reading.html#Examples-17","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Custom decimal separator","category":"page"},{"location":"reading.html#groupmark","page":"Reading","title":"groupmark / thousands separator","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A \"groupmark\" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark, if provided, must be an ASCII Char which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729 is ',' and the groupmark for the US social security number 875-39-3196 is -. By default, groupmark=nothing which indicates that there are no stray characters separating digits.","category":"page"},{"location":"reading.html#Examples-18","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Thousands separator\nCustom groupmarks","category":"page"},{"location":"reading.html#truestrings","page":"Reading","title":"truestrings / falsestrings","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"These arguments can be provided as Vector{String} to specify custom values that should be treated as the Bool true/false values for all the columns of a data input. By default, [\"true\", \"True\", \"TRUE\", \"T\", \"1\"] string values are used to detect true values, and [\"false\", \"False\", \"FALSE\", \"F\", \"0\"] string values are used to detect false values. Note that even though \"1\" and \"0\" can be used to parse true/false values, in terms of auto detecting column types, those values will be parsed as Int64 first, instead of Bool. To instead parse those values as Bools for a column, you can manually provide that column's type as Bool (see the type argument).","category":"page"},{"location":"reading.html#Examples-19","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Custom bool strings","category":"page"},{"location":"reading.html#types","page":"Reading","title":"types","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type, an AbstractVector of types, an AbstractDict, or a function.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"If a single type is provided, like types=Float64, then all columns in the data input will be parsed as Float64. If a column's value isn't a valid Float64 value, then a warning will be emitted, unless silencewarnings=false is passed, then no warning will be printed. However, if strict=true is passed, then an error will be thrown instead, regarldess of the silencewarnings argument.\nIf a AbstractVector{Type} is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.\nIf an AbstractDict, then specific columns can have their column type specified with the key of the dict being an Integer for column number, or String or Symbol for column name or Regex matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.\nIf a function, then it should be of the form (i, name) -> Union{T, Nothing}, and will be applied to each detected column during initial parsing. Returning nothing from the function will result in the column's type being automatically detected during parsing.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"By default types=nothing, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass types=Union{Float64, Missing} if the data input contains missing values. Parsing will detect missing values if present, and promote any manually provided column types from the singular (Float64) to the missing equivalent (Union{Float64, Missing}) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64, Float64, Date, DateTime, Time, Bool, String.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Non-standard types can be provided, like Dec64 from the DecFP.jl package, but must support the Base.tryparse(T, str) function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end, as long as a corresponding Base.tryparse definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';'))), where a single cell in the data input is like 1.23;4.56;7.89.","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Note that the default stringtype can be overridden by providing a column's type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString), where the first column will be parsed as a String, while any other string columns will have the PosLenString type.","category":"page"},{"location":"reading.html#Examples-20","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Matrix-like Data\nProviding types","category":"page"},{"location":"reading.html#typemap","page":"Reading","title":"typemap","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An AbstractDict{Type, Type} argument that allows replacing a non-String standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64, like typemap=IdDict(Int64 => Float64), which would cause any columns detected as Int64 to be parsed as Float64 instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String), which will cause any columns detected as Date to be parsed as String instead.","category":"page"},{"location":"reading.html#Examples-21","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Typemap","category":"page"},{"location":"reading.html#pool","page":"Reading","title":"pool","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Argument that controls whether columns will be returned as PooledArrays. Can be provided as a Bool, Float64, Tuple{Float64, Int}, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}. As a Bool, controls absolutely whether a column will be pooled or not; if passed as a single Bool argument like pool=true, then all string columns will be pooled, regardless of cardinality. When passed as a Float64, the value should be between 0.0 and 1.0 to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1, then all string columns with a unique value % less than 10% will be returned as PooledArray, while other string columns will be normal string vectors. If pool is provided as a tuple, like (0.2, 500), the first tuple element is the same as a single Float64 value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500) means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the pool argument is a single Bool, Real, or Tuple{Float64, Int}, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool, Float64, or Tuple{Float64, Int}. Similar to the types argument, providing a vector to pool should have an element for each column in the data input, while a dict argument can map column number/name to Bool, Float64, or Tuple{Float64, Int} for specific columns. Unspecified columns will not be pooled when the argument is a dict.","category":"page"},{"location":"reading.html#Examples-22","page":"Reading","title":"Examples","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Pooled values\nNon-string column pooling\nPool with absolute threshold","category":"page"},{"location":"reading.html#downcast","page":"Reading","title":"downcast","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument that controls whether Integer detected column types will be \"shrunk\" to the smallest possible integer type. Argument is false by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.","category":"page"},{"location":"reading.html#stringtype","page":"Reading","title":"stringtype","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"An argument that controls the precise type of string columns. Supported values are InlineString (the default), PosLenString, or String. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:","category":"page"},{"location":"reading.html","page":"Reading","title":"Reading","text":"InlineString: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than String. When string length has a high variance, it can lead to lots of \"wasted space\", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.\nPosLenString: results in columns returned as PosLenStringVector (or ChainedVector{PosLenStringVector} for the multithreaded case), which holds a reference to the original input data, and acts as one large \"view\" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns. PosLenStringVector, however, does not support traditional mutable operations like regular Vectors, like push!, append!, or deleteat!.\nString: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normal Vector{String} (or ChainedVector{Vector{String}}), which can be processed normally, including any mutating operations.","category":"page"},{"location":"reading.html#strict","page":"Reading","title":"strict / silencewarnings / maxwarnings","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing, emit a warning (i.e. silencewarnings=false and strict=false), but only up to 100 total warnings and then they'll be silenced (i.e. maxwarnings=100). If strict=true, then invalid values will result in an error being thrown instead of any warnings emitted.","category":"page"},{"location":"reading.html#debug","page":"Reading","title":"debug","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"A Bool argument that controls the printing of extra \"debug\" information while parsing. Can be useful if parsing doesn't produce the expected result or a bug is suspected in parsing somehow.","category":"page"},{"location":"reading.html#API-Reference","page":"Reading","title":"API Reference","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"CSV.read\nCSV.File\nCSV.Chunks\nCSV.Rows","category":"page"},{"location":"reading.html#CSV.read","page":"Reading","title":"CSV.read","text":"CSV.read(source, sink::T; kwargs...) => T\n\nRead and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.\n\nExample\n\njulia> using CSV, DataFrames\n\njulia> path = tempname();\n\njulia> write(path, \"a,b,c\\n1,2,3\");\n\njulia> CSV.read(path, DataFrame)\n1×3 DataFrame\n Row │ a b c\n │ Int64 Int64 Int64\n─────┼─────────────────────\n 1 │ 1 2 3\n\njulia> CSV.read(path, DataFrame; header=false)\n2×3 DataFrame\n Row │ Column1 Column2 Column3\n │ String1 String1 String1\n─────┼───────────────────────────\n 1 │ a b c\n 2 │ 1 2 3\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"function"},{"location":"reading.html#CSV.File","page":"Reading","title":"CSV.File","text":"CSV.File(input; kwargs...) => CSV.File\n\nRead a UTF-8 CSV input and return a CSV.File object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...) instead if the CSV.File intermediate object isn't needed.\n\nThe input argument can be one of:\n\nfilename given as a string or FilePaths.jl type\na Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer\na CodeUnits object, which wraps a String, like codeunits(str)\na csv-formatted string can also be passed like IOBuffer(str)\na Cmd or other IO\na gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing\na Vector of any of the above, which will parse and vertically concatenate each source, returning a single, \"long\" CSV.File\n\nTo read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response body can be passed like:\n\nusing Downloads, CSV\nf = CSV.File(Downloads.download(url))\n\n# or\n\nusing HTTP, CSV\nf = CSV.File(HTTP.get(url).body)\n\nOpens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector for all columns, or specified per column via name or index in a Dict).\n\nWhen a Vector of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using ChainedVectors to lazily concatenate each thread's columns.\n\nFor text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc\"ISO-8859-1\")).\n\nThe returned CSV.File object supports the Tables.jl interface and can iterate CSV.Rows. CSV.Row supports propertynames and getproperty to access individual row values. CSV.File also supports entire column access like a DataFrame via direct property access on the file object, like f = CSV.File(file); f.col1. Or by getindex access with column names, like f[:col1] or f[\"col1\"]. The returned columns are AbstractArray subtypes, including: SentinelVector (for integers), regular Vector, PooledVector for pooled columns, MissingVector for columns of all missing values, PosLenStringVector when stringtype=PosLenString is passed, and ChainedVector will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single \"chain\" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:\n\nfor row in CSV.File(file)\n println(\"a=$(row.a), b=$(row.b), c=$(row.c)\")\nend\n\nBy supporting the Tables.jl interface, a CSV.File can also be a table input to any other table sink function. Like:\n\n# materialize a csv file as a DataFrame, copying columns from CSV.File\ndf = CSV.File(file) |> DataFrame\n\n# to avoid making a copy of parsed columns, use CSV.read\ndf = CSV.read(file, DataFrame)\n\n# load a csv file directly into an sqlite database table\ndb = SQLite.DB()\ntbl = CSV.File(file) |> SQLite.load!(db, \"sqlite_table\")\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"type"},{"location":"reading.html#CSV.Chunks","page":"Reading","title":"CSV.Chunks","text":"CSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks\n\nReturns a file \"chunk\" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.\n\nThe ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.\n\nEach iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.\n\nThis functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"type"},{"location":"reading.html#CSV.Rows","page":"Reading","title":"CSV.Rows","text":"CSV.Rows(source; kwargs...) => CSV.Rows\n\nRead a csv input returning a CSV.Rows object.\n\nThe input argument can be one of:\n\nfilename given as a string or FilePaths.jl type\na Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer\na CodeUnits object, which wraps a String, like codeunits(str)\na csv-formatted string can also be passed like IOBuffer(str)\na Cmd or other IO\na gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing\n\nTo read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:\n\nf = CSV.Rows(HTTP.get(url).body)\n\nFor other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).\n\nWhile similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:\n\nVery minimal memory footprint; while iterating, only the current row values are buffered\nOnly provides row access via iteration; to access columns, one can stream the rows into a table type\nPerforms no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments\n\nOpens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:\n\nfor row in CSV.Rows(file)\n println(\"a=$(row.a), b=$(row.b), c=$(row.c)\")\nend\n\nArguments\n\nFile layout options:\n\nheader=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.\nnormalizenames::Bool=false: whether column names should be \"normalized\" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)\nskipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.\nfooterskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser\ntranspose::Bool: read a csv file \"transposed\", i.e. each column is parsed as a row\ncomment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.\nignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)\nselect: an AbstractVector of Integer, Symbol, String, or Bool, or a \"selector\" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.\ndrop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a \"drop\" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.\nlimit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary\nbuffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.\nntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)\nrows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows\nsource: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input \"name\" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.\n\nParsing options:\n\nmissingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring=\"\", which means only an empty field (two consecutive delimiters) is considered missing\ndelim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file\nignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells\nquoted::Bool=true: whether parsing should check for quotechar at the start/end of cells\nquotechar='\"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters\nescapechar='\"': the Char used to escape quote characters in a quoted field\ndateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.\ndecimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','\ngroupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).\ntruestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default \"true\", \"True\", \"TRUE\", \"T\", \"1\" are used to detect true and \"false\", \"False\", \"FALSE\", \"F\", \"0\" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument\nstripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names\n\nColumn Type Options:\n\ntypes: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict(\"column1\"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.\ntypemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only \"standard\" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is \"detected\", it will be mapped to the specified type.\npool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.\ndowncast::Bool=false: controls whether columns detected as Int64 will be \"downcast\" to the smallest possible integer type like Int8, Int16, Int32, etc.\nstringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special \"lazy\" AbstractVector that acts as a \"view\" into the original file data. This can lead to the most efficient parsing times, but note that the \"view\" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail\nstrict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing\nsilencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced\nmaxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings\ndebug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed\nvalidate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.\n\nIteration options:\n\nreusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is \"valid\")\n\n\n\n\n\n","category":"type"},{"location":"reading.html#Utilities","page":"Reading","title":"Utilities","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"CSV.detect","category":"page"},{"location":"reading.html#CSV.detect","page":"Reading","title":"CSV.detect","text":"CSV.detect(str::String)\n\nUse the same logic used by CSV.File to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows type, which returns each cell of a file as a String. The order of types attempted is: Int, Float64, Date, DateTime, Bool, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options type as a keyword argument option=ops for sentinel value detection.\n\n\n\n\n\n","category":"function"},{"location":"reading.html#Common-terms","page":"Reading","title":"Common terms","text":"","category":"section"},{"location":"reading.html#Standard-types","page":"Reading","title":"Standard types","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"The types that are detected by default when column types are not provided by the user otherwise. They include: Int64, Float64, Date, DateTime, Time, Bool, and String.","category":"page"},{"location":"reading.html#newlines","page":"Reading","title":"Newlines","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"For all parsing functionality, newlines are detected/parsed automatically, regardless if they're present in the data as a single newline character ('\\n'), single return character ('\\r'), or full CRLF sequence (\"\\r\\n\").","category":"page"},{"location":"reading.html#Cardinality","page":"Reading","title":"Cardinality","text":"","category":"section"},{"location":"reading.html","page":"Reading","title":"Reading","text":"Refers to the ratio of unique values to total number of values in a column. Columns with \"low cardinality\" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with \"high cardinality\" have a high % of unique values relative to total number of values. Think of these as \"id-like\" columns where each or almost each value is a unique identifier with no (or few) repeated values.","category":"page"},{"location":"writing.html#Writing","page":"Writing","title":"Writing","text":"","category":"section"},{"location":"writing.html","page":"Writing","title":"Writing","text":"CSV.write\nCSV.RowWriter","category":"page"},{"location":"writing.html#CSV.write","page":"Writing","title":"CSV.write","text":"CSV.write(file, table; kwargs...) => file\ntable |> CSV.write(file; kwargs...) => file\n\nWrite a Tables.jl interface input to a csv file, given as an IO argument or String/FilePaths.jl type representing the file name to write to. Alternatively, CSV.RowWriter creates a row iterator, producing a csv-formatted string for each row in an input table.\n\nSupported keyword arguments include:\n\nbufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown\ndelim::Union{Char, String}=',': a character or string to print out as the file's delimiter\nquotechar::Char='\"': ascii character to use for quoting text fields that may contain delimiters or newlines\nopenquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters\nescapechar::Char='\"': ascii character used to escape quote characters in a text field\nmissingstring::String=\"\": string to print for missing values\ndateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns\nappend=false: whether to append writing to an existing file/IO, if true, it will not write column names by default\ncompress=false: compress the written output using standard gzip compression (provided by the CodecZlib.jl package); note that a compression stream can always be provided as the first \"file\" argument to support other forms of compression; passing compress=true is just for convenience to avoid needing to manually setup a GzipCompressorStream\nwriteheader=!append: whether to write an initial row of delimited column names, not written by default if appending\nheader: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table\nnewline='\\n': character or string to use to separate rows (lines in the csv file)\nquotestrings=false: whether to force all strings to be quoted or not\ndecimal='.': character to use as the decimal point when writing floating point numbers\ntransform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)\nbom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not\npartition::Bool=false: by passing true, the table argument is expected to implement Tables.partitions and the file argument can either be an indexable collection of IO, file Strings, or a single file String that will have an index appended to the name\n\nExamples\n\nusing CSV, Tables, DataFrames\n\n# write out a DataFrame to csv file\ndf = DataFrame(rand(10, 10), :auto)\nCSV.write(\"data.csv\", df)\n\n# write a matrix to an in-memory IOBuffer\nio = IOBuffer()\nmat = rand(10, 10)\nCSV.write(io, Tables.table(mat))\n\n\n\n\n\n","category":"function"},{"location":"writing.html#CSV.RowWriter","page":"Writing","title":"CSV.RowWriter","text":"CSV.RowWriter(table; kwargs...)\n\nCreates an iterator that produces csv-formatted strings for each row in the input table.\n\nSupported keyword arguments include:\n\nbufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown\ndelim::Union{Char, String}=',': a character or string to print out as the file's delimiter\nquotechar::Char='\"': ascii character to use for quoting text fields that may contain delimiters or newlines\nopenquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters\nescapechar::Char='\"': ascii character used to escape quote characters in a text field\nmissingstring::String=\"\": string to print for missing values\ndateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns\nheader: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table\nnewline='\\n': character or string to use to separate rows (lines in the csv file)\nquotestrings=false: whether to force all strings to be quoted or not\ndecimal='.': character to use as the decimal point when writing floating point numbers\ntransform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)\nbom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not\n\n\n\n\n\n","category":"type"}]
}
diff --git a/dev/writing.html b/dev/writing.html
index b02ec633..455ee4d9 100644
--- a/dev/writing.html
+++ b/dev/writing.html
@@ -1,5 +1,5 @@
-Writing · CSV.jl
Write a Tables.jl interface input to a csv file, given as an IO argument or String/FilePaths.jl type representing the file name to write to. Alternatively, CSV.RowWriter creates a row iterator, producing a csv-formatted string for each row in an input table.
Supported keyword arguments include:
bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
escapechar::Char='"': ascii character used to escape quote characters in a text field
missingstring::String="": string to print for missing values
dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
append=false: whether to append writing to an existing file/IO, if true, it will not write column names by default
compress=false: compress the written output using standard gzip compression (provided by the CodecZlib.jl package); note that a compression stream can always be provided as the first "file" argument to support other forms of compression; passing compress=true is just for convenience to avoid needing to manually setup a GzipCompressorStream
writeheader=!append: whether to write an initial row of delimited column names, not written by default if appending
header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
newline='\n': character or string to use to separate rows (lines in the csv file)
quotestrings=false: whether to force all strings to be quoted or not
decimal='.': character to use as the decimal point when writing floating point numbers
transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not
partition::Bool=false: by passing true, the table argument is expected to implement Tables.partitions and the file argument can either be an indexable collection of IO, file Strings, or a single file String that will have an index appended to the name
Examples
using CSV, Tables, DataFrames
# write out a DataFrame to csv file
@@ -9,4 +9,4 @@
# write a matrix to an in-memory IOBuffer
io = IOBuffer()
mat = rand(10, 10)
-CSV.write(io, Tables.table(mat))
Creates an iterator that produces csv-formatted strings for each row in the input table.
Supported keyword arguments include:
bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
escapechar::Char='"': ascii character used to escape quote characters in a text field
missingstring::String="": string to print for missing values
dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
newline='\n': character or string to use to separate rows (lines in the csv file)
quotestrings=false: whether to force all strings to be quoted or not
decimal='.': character to use as the decimal point when writing floating point numbers
transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not
Creates an iterator that produces csv-formatted strings for each row in the input table.
Supported keyword arguments include:
bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
escapechar::Char='"': ascii character used to escape quote characters in a text field
missingstring::String="": string to print for missing values
dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
newline='\n': character or string to use to separate rows (lines in the csv file)
quotestrings=false: whether to force all strings to be quoted or not
decimal='.': character to use as the decimal point when writing floating point numbers
transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not