forked from SquareBracketAssociates/EnterprisePharo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Move 6 chapters: ReadyToReviews -> EnterprisePharo
NeoCSV NeoJSON RenoirST Voyage WebSockets WebApp
- Loading branch information
1 parent
a8bb638
commit 3c48714
Showing
20 changed files
with
2,801 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,207 @@ | ||
! Handling CSV with NeoCSV | ||
|
||
CSV (Comma Separated Values) is a popular data-interchange format. NeoCSV is an elegant and efficient standalone Pharo framework to read and write CSV converting to or from Pharo objects. This chapter has been written by | ||
Sven Van Caekenberghe the author of NeoCSV and many other nicely designed Pharo libraries. | ||
|
||
!! An introduction to CSV | ||
|
||
CSV is a lightweight text-based de facto standard for human-readable tabular data interchange. Essentially, the key characteristics are that CSV (or more generally, delimiter separated text data): | ||
|
||
- is text-based (ASCII, Latin1, Unicode), | ||
- consists of records, 1 per line (any line ending convention), | ||
- where records consist of fields separated by a delimiter (comma, tab, semicolon), | ||
- where every record has the same number of fields, and | ||
- where fields can be quoted should they contain separators or line endings. | ||
|
||
Here are some relevant links: | ||
|
||
*http://en.wikipedia.org/wiki/Comma-separated_values* | ||
*http://tools.ietf.org/html/rfc4180* | ||
|
||
Note that there is not one single official standard specification. | ||
|
||
!! NeoCSV | ||
|
||
The NeoCSV framework contains a reader (==NeoCSVReader==) and a writer (==NeoCSVWriter==) to parse respectively generate delimiter separated text data to or from Smalltalk objects. The goals of this project are: | ||
|
||
- to be standalone (have no dependencies and have little requirements), | ||
- to be small, elegant and understandable, | ||
- to be efficient (both in time and space), and | ||
- to be flexible and non-intrusive. | ||
|
||
To use either the reader or the writer, you instantiate them on a character stream and use standard stream access messages. Here are two examples: | ||
|
||
The first one read a sequences of data separated by ==,== and containing some line breaks. The reader produces arrays corresponding to the lines with the data. | ||
|
||
[[[ | ||
(NeoCSVReader on: '1,2,3\4,5,6\7,8,9' withCRs readStream) upToEnd. | ||
--> #(#('1' '2' '3') #('4' '5' '6') #('7' '8' '9')) | ||
]]] | ||
|
||
|
||
The second proceeds from the inverse: given a set of data as arrays it produces comma separated lines. | ||
|
||
[[[ | ||
String streamContents: [ :stream | | ||
(NeoCSVWriter on: stream) | ||
nextPutAll: #( (x y z) (10 20 30) (40 50 60) (70 80 90) ) ]. | ||
|
||
--> | ||
'"x","y","z" | ||
"10","20","30" | ||
"40","50","60" | ||
"70","80","90" | ||
' | ||
]]] | ||
|
||
!! Generic Mode | ||
|
||
NeoCSV can operate in generic mode without any further customization. While writing, | ||
|
||
- record objects should respond to the ==do:== protocol, | ||
- fields are always sent ==asString== and quoted, and | ||
- CRLF line ending is used. | ||
|
||
While reading, | ||
|
||
- records become arrays, | ||
- fields remain strings, | ||
- any line ending is accepted, and | ||
- both quoted and unquoted fields are allowed. | ||
|
||
The standard delimiter is a comma character. Quoting is always done using a double quote character. A double quote character inside a field will be escaped by repeating it. Field separators and line endings are allowed inside a quoted field. Any whitespace is significant. | ||
|
||
|
||
!! Customizing NeoCSVWriter | ||
|
||
Any character can be used as field separator, for example: | ||
|
||
[[[ | ||
neoCSVWriter separator: Character tab | ||
]]] | ||
or | ||
[[[ | ||
neoCSVWriter separator: $; | ||
]]] | ||
|
||
|
||
Likewise, any of the three common line end conventions can be set: in the following example we set carriage return. | ||
|
||
[[[ | ||
neoCSVWriter lineEndConvention: #cr | ||
]]] | ||
|
||
There are 3 ways a field can be written (in increasing order of efficiency): | ||
|
||
- quoted - converting it with ==asString== and quoting it (the default), | ||
- raw - converting it with ==asString== but not quoting it, and | ||
- object - not quoting it and using ==printOn:== directly on the output stream. | ||
|
||
Obviously, when disabling quoting, you have to be sure your values do not contain embedded separators or line endings. If you are writing arrays of numbers for example, this would be the fastest way to do it: | ||
|
||
[[[ | ||
neoCSVWriter | ||
fieldWriter: #object; | ||
nextPutAll: #( (100 200 300) (400 500 600) (700 800 900) ) | ||
]]] | ||
|
||
The ==fieldWriter== option applies to all fields. | ||
|
||
If your data is in the form of regular domain level objects it would be wasteful to convert them to arrays just for writing them as CSV. NeoCSV has a non-intrusive option to map your domain object's fields: You add field specifications based on accessors. This is how you would write an array of Points. | ||
|
||
[[[ | ||
neoCSVWriter | ||
nextPut: #(x y); | ||
addFields: #(x y); | ||
nextPutAll: { 1@2. 3@4. 5@6 } | ||
]]] | ||
|
||
Note how we first write the header (before customizing the writer). The ==addField:== and ==addFields:== methods arrange for the specified accessors to be performed on the incoming objects to produce values that will be written by the fieldWriter. Additionally, there is a protocol to specify different field writing behavior per field, using ==addQuotedField:==, ==addRawField:== and ==addObjectField:==. To specify different field writers for an array (actually an SequenceableCollection subclass), you can use the methods first, second, third, ... as accessors. | ||
|
||
SD: What is the difference between nextPut and addFields: | ||
SD: for the comics should I add a method to handle the relationship to other domain objects? | ||
|
||
!! Customizing NeoCSVReader | ||
|
||
The parser is flexible and forgiving. Any line ending will do, quoted and non-quoted fields are allowed. | ||
|
||
Any character can be used as field separator, for example: | ||
|
||
[[[ | ||
neoCSVReader separator: Character tab | ||
]]] | ||
or | ||
|
||
[[[ | ||
neoCSVReader separator: $; | ||
]]] | ||
|
||
NeoCSVReader will produce records that are instances of its ==recordClass==, which defaults to ==Array==. All fields are always read as Strings. If you want, you can specify converters for each field, to convert them to integers or floats, any other object. Here is an example: | ||
|
||
[[[ | ||
neoCSVReader | ||
addIntegerField; | ||
addFloatField; | ||
addField; | ||
addFieldConverter: [ :string | Date fromString: string ]; | ||
upToEnd. | ||
]]] | ||
|
||
Here we specify 4 fields: an integer, a float, a string and a date field. Field conversions specified this way only work on indexable record classes, like Array. | ||
|
||
In many cases you will probably want your data to be returned as one of your domain objects. It would be wasteful to first create arrays and then convert all those. NeoCSV has non-intrusive options to create instances of your own object classes and to convert and set fields on them directly. This is done by specifying accessors and converters. Here is an example for reading Associations of Floats. | ||
|
||
|
||
[[[ | ||
(NeoCSVReader on: '1.5,2.2\4.5,6\7.8,9.1' withCRs readStream) | ||
recordClass: Association; | ||
addFloatField: #key: ; | ||
addFloatField: #value: ; | ||
upToEnd. | ||
]]] | ||
|
||
For each field you give the mutator accessor to use, as well as an implicit or explicit conversion block. | ||
|
||
One thing that it will enforce is that all records have an equal number of fields. When there are no field accessors or converters, the field count will be set automatically after the first record is read. If you want you could set it upfront. When there are field accessors or converters, the field count will be set to the number of specified fields. | ||
|
||
!! Reading a lot of objects | ||
|
||
Handling large CSV file is possible with NeoCVS since it pays attention to have | ||
For example its ==do:== is implemented as streaming over the record one by one, never holding more than one in memory. | ||
Here is a little example generating first over 300 Mb of data and loading it partly. | ||
|
||
[[[ | ||
'paul.csv' asFileReference writeStreamDo: [ :file| | ||
ZnBufferedWriteStream on: file do: [ :out | | ||
(NeoCSVWriter on: out) in: [ :writer | | ||
writer writeHeader: { #Number. #Color. #Integer. #Boolean}. | ||
1 to: 1e7 do: [ :each | | ||
writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom. #(true false) atRandom } ] ] ] ]. | ||
]]] | ||
|
||
|
||
This results in a 300Mb file: | ||
|
||
[[[ | ||
]]]$ ls -lah paul.csv | ||
-rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv | ||
$ wc paul.csv | ||
10000001 10000001 342781577 paul.csv | ||
]]] | ||
|
||
|
||
This is a selective read and collect (loads about 10K records): | ||
|
||
[[[ | ||
Array streamContents: [ :out | | ||
'paul.csv' asFileReference readStreamDo: [ :in | | ||
(NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader | | ||
reader skipHeader; addIntegerField; addSymbolField; addIntegerField; addFieldConverter: [ :x | x = #true ]. | ||
reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ] ] ] ] ]. | ||
]]] | ||
|
||
|
||
!! Conclusion | ||
|
||
NeoCSV provides a good support to emit or import comma separated data. | ||
|
Oops, something went wrong.