-
Notifications
You must be signed in to change notification settings - Fork 2
/
srfi-13.txt
1903 lines (1508 loc) · 79.6 KB
/
srfi-13.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
The SRFI 13 string libraries -*- outline -*-
Olin Shivers
1999/10/2
Last Update: 2001/12/23
Emacs should display this document in outline mode. Say c-h m for
instructions on how to move through it by sections (e.g., c-c c-n, c-c c-p).
* Table of contents
-------------------
Abstract
Procedure index
Rationale
Strings are code-point sequences
String operations are locale- and context-independent
Internationalisation & super-ASCII character types
Case mapping and case folding
String equality & string normalisation
String inequality
Naming conventions
Shared storage
R4RS/R5RS procedures
Extra-SRFI recommendations
Procedure Specification
Main procedures
Predicates
Constructors
List & string conversion
Selection
Modification
Comparison
Prefixes & suffixes
Searching
Alphabetic case mapping
Reverse & append
Fold, unfold & map
Replicate & rotate
Miscellaneous: insertion, parsing
Filtering & deleting
Low-level procedures
START/END optional argument parsing & checking utilities
Knuth-Morris-Pratt searching
Reference implementation
Acknowledgements
References & links
Copyright
-------------------------------------------------------------------------------
* Abstract
----------
R5RS Scheme has an impoverished set of string-processing utilities, which is
a problem for authors of portable code. This SRFI proposes a coherent and
comprehensive set of string-processing procedures; it is accompanied by a
reference implementation of the spec. The reference implementation is
- portable
- efficient
- open source
The routines in this SRFI are backwards-compatible with the string-processing
routines of R5RS.
-------------------------------------------------------------------------------
* Procedure index
-----------------
Here is a list of the procedures provided by the string-lib
and string-lib-internals packages. "#" marks R5RS procedures; "+"
marks extended R5RS procedures.
Predicates
# string?
string-null?
string-every string-any
Constructors
# make-string
# string
string-tabulate
List & string conversion
+ string->list
# list->string
reverse-list->string
string-join
Selection
# string-length
# string-ref
+ string-copy
substring/shared
string-copy!
string-take string-take-right
string-drop string-drop-right
string-pad string-pad-right
string-trim string-trim-right string-trim-both
Modification
# string-set!
+ string-fill!
Comparison
string-compare string-compare-ci
string<> string= string< string> string<= string>=
string-ci<> string-ci= string-ci< string-ci> string-ci<= string-ci>=
string-hash string-hash-ci
Prefixes & suffixes
string-prefix-length string-suffix-length
string-prefix-length-ci string-suffix-length-ci
string-prefix? string-suffix?
string-prefix-ci? string-suffix-ci?
Searching
string-index string-index-right
string-skip string-skip-right
string-count
string-contains string-contains-ci
Alphabetic case mapping
string-titlecase string-upcase string-downcase
string-titlecase! string-upcase! string-downcase!
Reverse & append
string-reverse string-reverse!
# string-append
string-concatenate
string-concatenate/shared string-append/shared
string-concatenate-reverse string-concatenate-reverse/shared
Fold, unfold & map
string-map string-map!
string-fold string-fold-right
string-unfold string-unfold-right
string-for-each string-for-each-index
Replicate & rotate
xsubstring string-xcopy!
Miscellaneous: insertion, parsing
string-replace string-tokenize
Filtering & deleting
string-filter string-delete
Low-level procedures
string-parse-start+end
string-parse-final-start+end
let-string-start+end
check-substring-spec
substring-spec-ok?
make-kmp-restart-vector kmp-step string-kmp-partial-search
-------------------------------------------------------------------------------
* Rationale
-----------
This SRFI defines two libraries that provide a rich set of operations for
manipulating strings. These are frequently useful for scripting and other
text-manipulation applications. The library's design was influenced by the
string libraries found in MIT Scheme, Gambit, RScheme, MzScheme, slib, Common
Lisp, Bigloo, guile, Chez, APL, Java, and the SML standard basis.
All procedures involving character comparison are available in
both case-sensitive and case-insensitive forms.
All functionality is available in substring and full-string forms.
** Strings are code-point sequences
===================================
This SRFI considers strings simply to be a sequence of "code points" or
character encodings. Operations such as comparison or reversal are always done
code point by code point. See the comments below on super-ASCII character
types for implications that follow.
It's entirely possible that a legal string might not be a sensible "text"
sequence. For example, consider a string comprised entirely of zero-width
Unicode accent characters with no preceding base character to modify --
this is a legal string, albeit one that does not make a great deal of sense
when interpreted as a sequence of natural-language text. The routines in
this SRFI do not handle these "text" concerns; they restrict themselves
to the underlying view of strings as merely a sequence of "code points."
** String operations are locale- and context-independent
========================================================
This SRFI defines string operations that are locale- and context-independent.
While it is certainly important to have a locale-sensitive comparison or
collation procedure when processing text, it is also important to have a suite
of operations that are reliably invariant for basic string processing ---
otherwise, a change of locale could cause data structures such as hash tables,
b-trees, symbol tables, directories of filenames, etc. to become corrupted.
Locale- and context-sensitive text operations, such as collation, are
explicitly deferred to a subsequent, companion "text" SRFI.
** Internationalisation & super-ASCII character types
=====================================================
The major issue confronting this SRFI is the existence of super-ASCII
character encodings, such as eight-bit Latin-1 or 16- and 32-bit Unicode. It
is a design goal of this SRFI for the API to be portable across string
implementations based on at least these three standard encodings.
Unfortunately, this places strong limitations on the API design. Here are
some relevant issues. Be warned that life in a super-ASCII world is
significantly more complex; there are no easy answers for many of these issues.
*** Case mapping and case-folding
=================================
Upper- and lower-casing characters is complex in super-ASCII encodings.
- Some characters case-map to more than one character. For example,
the Latin-1 German eszet character upper-cases to "SS."
+ This means that the R5RS function CHAR-UPCASE is not well-defined,
since it is defined to produce a (single) character result.
+ It means that an in-place STRING-UPCASE! procedure cannot be reliably
defined, since the original string may not be long enough to contain
the result -- an N-character string might upcase to a 2N-character result.
+ It means that case-insensitive string-matching or searching is quite
tricky. For example, an n-character string S might match a 2N-character
string S'.
- Some characters case-map in different ways depending upon their surrounding
context. For example, the Unicode Greek capital sigma character downcases
differently depending upon whether or not it is the final character in a
word. Again, this spells trouble for the simple R5RS CHAR-DOWNCASE function.
- Unicode defines three cases: lowercase, uppercase and titlecase. The
distinction between uppercase and titlecase arises in the presence of
Unicode's compound characters. For example, Unicode has a single character
representing the compound pair "dz." Uppercasing the "dz" character produces
the compound character "DZ", while titlecasing (or, as Americans say,
capitalizing) it produces compound character "Dz".
- Turkish actually has different case-mappings from other languages.
The Unicode Consortium's web site
http://www.unicode.org/
has detailed discussions of the issues. See in particular technical report
21 on case mappings
http://www.unicode.org/unicode/reports/tr21/
SRFI 13 makes no attempt to deal with these issues; it uses a simple 1-1
locale- and context-independent case-mapping, specifically Unicode's 1-1
case-mappings given in
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
The format of this file is explained in
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html
Note that this means that German eszet upper-cases to itself, not "SS".
Case-mapping and case-folding operations in SRFI 13 are locale-independent so
that shifting locales won't wreck hash tables, b-trees, symbol tables, etc.
*** String equality & string normalisation
==========================================
Comparing strings for equality is complicated because in some cases Unicode
actually provides multiple encodings for the "same" character, and because
what we usually think of as a "character" can be represented in Unicode as a
*sequence* of several code-points. For example, consider the letter "e" with
an acute accent. There is a single Unicode character for this. However,
Unicode also allows one to represent this with a two-character sequence: the
"e" character followed by a zero-width acute-accent character. As another
example, Unicode provides some Asian characters in "narrow" and "full" widths.
There are multiple ways we might want to compare strings for equality. In
(roughly) decreasing order of precision,
- we might want a precise comparison of the actual encoding, so that
<e-acute> would *not* compare equal to <e, acute>.
- We might want a "normalised" comparison, where these two sequences would
compare equal.
- We might want an even more-permissive normalisation, where visually-distinct
properties of "the same" character would be ignored. For example, we might
want narrow/full-width versions of the same Asian character to compare equal.
- We might want comparisons that are insensitive to accents and diacritical
marks.
- We might want comparisons that are case-insensitive.
- We might want comparisons that are insensitive to several of the above
properties.
- We might want ways to "normalise" strings into various canonical forms.
This library does not address these complexities. SRFI 13 string equality is
simply based upon comparing the encoding values used for the characters.
Accent-insensitive and other types of comparison are not provided; only
a simple form of case-insensitive comparison is provided, which uses the
1-1 case mappings specified by Unicode in
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
These are adequate for "program" or "systems" use of strings (e.g., to
manipulate program identifiers and operating-system filenames).
*** String inequality
=====================
Above and beyond the issues arising in string-equality, when we attempt
to order strings there are even further considerations.
- French orders accents with right-to-left significance -- the reverse of
the significance of the characters.
- Case-insensitive ordering is not well defined by simple "code-point"
considerations, even for simple ASCII: there are punctuation characters
between the ASCII's upper-case range of letters and its lower-case range
(left-bracket, backslash, right-bracket, caret, underbar and backquote).
Does left-bracket compare less-than or greater-than "a" in a
case-insensitive comparison?
- The German eszet character should sort as if it were the *pair* of
letters "ss".
Unicode defines a complex set of machinery for ordering or "collating"
strings, which involves mapping each string to a multi-byte sort key,
and then doing simple lexicographic sorting with these keys. These rules
can be overlaid by additional domain- or language-specific rules. Again,
this SRFI does not address these issues. SRFI 13 string ordering is strictly
based upon a character-by-character comparison of the values used for
representing the string.
** Naming conventions
=====================
This library contains a large number of procedures, but they follow
a consistent naming scheme, and are consistent with the conventions
developed in SRFI 1. The names are composed of smaller lexemes
in a regular way that exposes the structure and relationships between the
procedures. This should help the programmer to recall or reconstitute the name
of the particular procedure that he needs when writing his own code. In
particular
- Procedures whose names end in "-ci" are case-insensitive variants.
- Procedures whose names end in "!" are side-effecting variants.
What values these procedures return is usually not specified.
- The order of common parameters is consistent across the
different procedures.
- Left/right/both directionality
Procedures that have left/right directional variants
use the following convention:
Direction Naming convention
------------- -----------------
left-to-right no suffix
right-to-left -RIGHT suffix
both -BOTH suffix
This is a general convention that was established in SRFI 1.
The value of a convention is proportional to the extent of its
use.
** Shared storage
=================
Some Scheme implementations, e.g. guile and T, provide ways to construct
substrings that share storage with other strings. This facility is called
"shared-text substrings." Shared-text substrings can be used to eliminate the
allocation and copying time and space required to produce substrings, which
can be a tremendous savings for some applications, reducing a linear-time
operation to constant time. Additionally, some algorithms rely on the sharing
property of these substrings -- the application assumes that if the underlying
storage is mutated, then all strings sharing that storage will show the
change. However, shared-text substrings are not a common feature; most Scheme
implementations do not provide them.
SRFI 13 takes a middle ground with respect to shared-text substrings. In
particular, a Scheme implementation does not need to have shared-text
substrings in order to implement this SRFI.
There is an additional form of storage sharing enabled by some SRFI 13
procedures, even without the benefit of shared-text substrings. In
some cases, some SRFI 13 routines are allowed to return as a result one
of the strings that was passed in as a parameter. For example, when
constructing a substring with the SUBSTRING/SHARED procedure, if the
requested substring is the entire string, the procedure is permitted
simply to return the original value. That is,
(eq? s (substring/shared s 0 (string-length s))) => true or false
whereas the R5RS SUBSTRING function is required to allocate a fresh copy
(eq? s (substring s 0 (string-length s))) => false.
In keeping with SRFI 13's general approach to sharing, compliant
implementations are allowed, but not required, to provide this kind of
sharing. Hence, procedures may not *rely* upon sharing in these cases.
Most procedures that permit results to share storage with inputs have
equivalent procedures that require allocating fresh storage for results.
If an application wishes to be sure a new, fresh string is allocated, then
these "pure" procedures should be used.
Fresh copy guaranteed Sharing permitted
--------------------- -----------------
string-copy substring/shared
string-copy string-take string-take-right
string-copy string-drop string-drop-right
string-concatenate string-concatenate/shared
string-append string-append/shared
string-concatenate-reverse string-concatenate-reverse/shared
string-pad string-pad-right
string-trim string-trim-right string-trim-both
string-filter string-delete
On the other hand, the functionality is present to allow one to write
efficient code *without* shared-text substrings. You can write efficient code
that works by passing around start/end ranges indexing into a string instead
of simply building a shared-text substring. The API would be much simpler
without this consideration -- if we had cheap shared-text substrings, all the
start/end index parameters would vanish. However, since SRFI 13 does not
require implementations to provide shared-text substrings, the extended
API is provided.
** R4RS/R5RS procedures
=======================
The R4RS and R5RS reports define 22 string procedures. The string-lib
package includes 8 of these exactly as defined, 3 in an extended,
backwards-compatible way, and drops the remaining 11 (whose functionality
is available via other bindings).
The 8 procedures provided exactly as documented in the reports are
string?
make-string
string
string-length
string-ref
string-set!
string-append
list->string
The eleven functions not included are
string=? string-ci=?
string<? string-ci<?
string>? string-ci>?
string<=? string-ci<=?
string>=? string-ci>=?
substring
The string-lib package provides alternate bindings and extended functionality.
Additionally, the three extended procedures are
string-fill! s char [start end] -> unspecified
string->list s [start end] -> char-list
string-copy s [start end] -> string
They are uniformly extended to take optional start/end parameters specifying
substring ranges.
** Extra-SRFI recommendations
=============================
This SRFI recommends the following
- A SRFI be defined for shared-text substrings, allowing programs to
be written that actually rely on the shared-storage properties of these data
structures.
- A SRFI be defined for manipulating Unicode text -- various normalisation
operations, collation, searching, etc. Collation operations might be
parameterised by a "collation" structure representing collation rules
for a particular locale or language. Alternatively, a data structure
specifying collation rules could be activated with dynamic scope by
special procedures, possibly overridden by allowing collation rules
to be optional arguments to procedures that need to order strings, e.g.
(with-locale* denmark-locale
(lambda ()
(f x)
(g 42)))
(with-locale taiwan-locale
(f x)
(h denmark-locale)
(g 42))
(set-locale! denmark-locale)
- A SRFI be defined for manipulating characters that is portable across
at least ASCII, Latin-1 and Unicode.
- For backwards-compatibility, CHAR-UPCASE and CHAR-DOWNCASE should
be defined to use the 1-1 locale- and context-insensitive case
mappings given by Unicode's UnicodeData.txt table.
- numeric codes for standard functions that map between characters and
integers should be required to use the Unicode/Latin-1/ASCII mapping. This
allows programmers to write portable code.
- CHAR-TITLECASE be added to CHAR-UPCASE and CHAR-DOWNCASE
- CHAR-TITLECASE? be added to CHAR-UPCASE? and CHAR-DOWNCASE?
- Title/up/down-case functions be added to the character-processing suite
which allow 1->n case maps by returning immutable,
possibly-multi-character strings instead of single characters. These case
mappings need not be locale- or context-sensitive.
These recommendations are not a part of the SRFI 13 spec. Note also that
requiring a Unicode/Latin-1/ASCII interface to integer/char mapping
functions does not imply anything about the actual underlying encodings of
characters.
-------------------------------------------------------------------------------
* Procedure Specification
-------------------------
In the following procedure specifications:
- An S parameter is a string.
- A CHAR parameter is a character.
- START and END parameters are half-open string indices specifying
a substring within a string parameter; when optional, they default
to 0 and the length of the string, respectively. When specified, it
must be the case that 0 <= START <= END <= (string-length S), for
the corresponding parameter S. They typically restrict a procedure's
action to the indicated substring.
- A PRED parameter is a unary character predicate procedure, returning
a true/false value when applied to a character.
- A CHAR/CHAR-SET/PRED parameter is a value used to select/search
for a character in a string. If it is a character, it is used in
an equality test; if it is a character set, it is used as a
membership test; if it is a procedure, it is applied to the
characters as a test predicate.
- An I parameter is an exact non-negative integer specifying an index
into a string.
- LEN and NCHARS parameters are exact non-negative integers specifying a
length of a string or some number of characters.
- An OBJ parameter may be any value at all.
Passing values to procedures with these parameters that do not satisfy these
types is an error.
If a procedure is said to return "unspecified," this means that nothing at all
is said about what the procedure returns, not even the number of return
values. Such a procedure is not even required to be consistent from call to
call in the nature or number of its return values. It is specifically permitted
for such a procedure to return zero values, e.g., by calling (VALUES).
Parameters given in square brackets are optional. Unless otherwise noted in
the text describing the procedure, any prefix of these optional parameters may
be supplied, from zero arguments to the full list. When a procedure returns
multiple values, this is shown by listing the return values in square
brackets, as well. So, for example, the procedure with signature
halts? f [x init-store] -> [boolean integer]
would take one (F), two (F, X) or three (F, X, INPUT-STORE) input parameters,
and return two values, a boolean and an integer.
A parameter followed by "..." means zero-or-more elements. So the procedure
with the signature
sum-squares x ... -> number
takes zero or more arguments (X ...), while the procedure with signature
spell-check doc dict1 dict2 ... -> string-list
takes two required parameters (DOC and DICT1) and zero or more
optional parameters (DICT2 ...).
If a procedure is said to return "unspecified," this means that nothing at all
is said about what the procedure returns. Such a procedure is not even
required to be consistent from call to call. It is simply required to return a
value (or values) that may be passed to a command continuation, e.g. as the
value of an expression appearing as a non-terminal subform of a BEGIN
expression. Note that in R5RS, this restricts such a procedure to returning a
single value; non-R5RS systems may not even provide this restriction.
** Main procedures
==================
In a Scheme system that has a module or package system, these procedures
should be contained in a module named "string-lib".
*** Predicates
==============
string? obj -> boolean R5RS
Returns #t if OBJ is a string, otherwise returns #f.
string-null? s -> boolean
Is S the empty string?
string-every char/char-set/pred s [start end] -> value
string-any char/char-set/pred s [start end] -> value
Checks to see if the given criteria is true of every / any character in S,
proceeding from left (index START) to right (index END).
If CHAR/CHAR-SET/PRED is a character, it is tested for equality with
the elements of S.
If CHAR/CHAR-SET/PRED is a character set, the elements of S are tested
for membership in the set.
If CHAR/CHAR-SET/PRED is a predicate procedure, it is applied to the
elements of S. The predicate is "witness-generating:"
- If STRING-ANY returns true, the returned true value is the one produced
by the application of the predicate.
- If STRING-EVERY returns true, the returned true value is the one
produced by the final application of the predicate to S[END].
If STRING-EVERY is applied to an empty sequence of characters,
it simply returns #T.
If STRING-EVERY or STRING-ANY apply the predicate to the final element
of the selected sequence (i.e., S[END-1]), that final application is a
tail call.
The names of these procedures do not end with a question mark -- this is to
indicate that, in the predicate case, they do not return a simple boolean
(#T or #F), but a general value.
*** Constructors
================
make-string len [char] -> string R5RS
MAKE-STRING returns a newly allocated string of length LEN. If
CHAR is given, then all elements of the string are initialized
to CHAR, otherwise the contents of the string are unspecified.
string char1 ... -> string R5RS
Returns a newly allocated string composed of the argument characters.
string-tabulate proc len -> string
PROC is an integer->char procedure. Construct a string of size LEN
by applying PROC to each index to produce the corresponding string
element. The order in which PROC is applied to the indices is not
specified.
*** List & string conversion
============================
string->list s [start end] -> char-list R5RS+
list->string char-list -> string R5RS
STRING->LIST returns a newly allocated list of the characters
that make up the given string. LIST->STRING returns a newly
allocated string formed from the characters in the list CHAR-LIST,
which must be a list of characters. STRING->LIST and LIST->STRING
are inverses so far as EQUAL? is concerned.
STRING->LIST is extended from the R5RS definition to take optional
START/END arguments.
reverse-list->string char-list -> string
An efficient implementation of (compose list->string reverse):
(reverse-list->string '(#\a #\B #\c)) -> "cBa"
This is a common idiom in the epilog of string-processing loops
that accumulate an answer in a reverse-order list. (See also
STRING-CONCATENATE-REVERSE for the "chunked" variant.)
string-join string-list [delimiter grammar] -> string
This procedure is a simple unparser --- it pastes strings together using
the delimiter string.
The GRAMMAR argument is a symbol that determines how the delimiter is
used, and defaults to 'infix.
- 'infix means an infix or separator grammar: insert the delimiter
between list elements. An empty list will produce an empty string --
note, however, that parsing an empty string with an infix or separator
grammar is ambiguous. Is it an empty list, or a list of one element,
the empty string?
- 'strict-infix means the same as 'infix, but will raise an error
if given an empty list.
- 'suffix means a suffix or terminator grammar: insert the delimiter
after every list element. This grammar has no ambiguities.
- 'prefix means a prefix grammar: insert the delimiter
before every list element. This grammar has no ambiguities.
The delimiter is the string used to delimit elements; it defaults to
a single space " ".
(string-join '("foo" "bar" "baz") ":") => "foo:bar:baz"
(string-join '("foo" "bar" "baz") ":" 'suffix) => "foo:bar:baz:"
;; Infix grammar is ambiguous wrt empty list vs. empty string,
(string-join '() ":") => ""
(string-join '("") ":") => ""
;; but suffix & prefix grammars are not.
(string-join '() ":" 'suffix) => ""
(string-join '("") ":" 'suffix) => ":"
*** Selection
=============
string-length s -> integer R5RS
Returns the number of characters in the string S.
string-ref s i -> char R5RS
Return character S[I] using zero-origin indexing.
I must be a valid index of S.
string-copy s [start end] -> string R5RS+
substring/shared s start [end] -> string
SUBSTRING/SHARED returns a string whose contents are the characters of S
beginning with index START (inclusive) and ending with index END
(exclusive). It differs from the R5RS SUBSTRING in two ways:
- The END parameter is optional, not required.
- SUBSTRING/SHARED may return a value that shares memory with S or
is EQ? to S.
STRING-COPY is extended from its R5RS definition by the addition of
its optional START/END parameters. In contrast to SUBSTRING/SHARED,
it is guaranteed to produce a freshly-allocated string.
Use STRING-COPY when you want to indicate explicitly in your code that you
wish to allocate new storage; use SUBSTRING/SHARED when you don't care if
you get a fresh copy or share storage with the original string.
(string-copy "Beta substitution") => "Beta substitution"
(string-copy "Beta substitution" 1 10)
=> "eta subst"
(string-copy "Beta substitution" 5) => "substitution"
string-copy! target tstart s [start end] -> unspecified
Copy the sequence of characters from index range [START,END) in
string S to string TARGET, beginning at index TSTART. The characters
are copied left-to-right or right-to-left as needed -- the copy is
guaranteed to work, even if TARGET and S are the same string.
It is an error if the copy operation runs off the end of the target
string, e.g.
(string-copy! (string-copy "Microsoft") 0
"Regional Microsoft Operating Companies") => error
string-take s nchars -> string
string-drop s nchars -> string
string-take-right s nchars -> string
string-drop-right s nchars -> string
STRING-TAKE returns the first NCHARS of STRING;
STRING-DROP returns all but the first NCHARS of STRING.
STRING-TAKE-RIGHT returns the last NCHARS of STRING;
STRING-DROP-RIGHT returns all but the last NCHARS of STRING.
If these procedures produce the entire string, they may return either
S or a copy of S; in some implementations, proper substrings may share
memory with S.
(string-take "Pete Szilagyi" 6) => "Pete S"
(string-drop "Pete Szilagyi" 6) => "zilagyi"
(string-take-right "Beta rules" 5) => "rules"
(string-drop-right "Beta rules" 5) => "Beta "
It is an error to take or drop more characters than are in the string:
(string-take "foo" 37) => *error*
string-pad s len [char start end] -> string
string-pad-right s len [char start end] -> string
Build a string of length LEN comprised of S padded on the left (right)
by as many occurrences of the character CHAR as needed. If S has more
than LEN chars, it is truncated on the left (right) to length LEN. CHAR
defaults to #\space.
If LEN <= END-START, the returned value is allowed to share storage
with S, or be exactly S (if LEN = END-START).
(string-pad "325" 5) => " 325"
(string-pad "71325" 5) => "71325"
(string-pad "8871325" 5) => "71325"
string-trim s [char/char-set/pred start end] -> string
string-trim-right s [char/char-set/pred start end] -> string
string-trim-both s [char/char-set/pred start end] -> string
Trim S by skipping over all characters on the left / on the right /
on both sides that satisfy the second parameter CHAR/CHAR-SET/PRED:
- if it is a character CHAR, characters equal to CHAR are trimmed;
- if it is a char set CS, characters contained in CS are trimmed;
- if it is a predicate PRED, it is a test predicate that is applied
to the characters in S; a character causing it to return true
is skipped.
CHAR/CHAR/SET-PRED defaults to the character set CHAR-SET:WHITESPACE
defined in SRFI 14.
If no trimming occurs, these functions may return either S or a copy of S;
in some implementations, proper substrings may share memory with S.
(string-trim-both " The outlook wasn't brilliant, \n\r")
=> "The outlook wasn't brilliant,"
*** Modification
================
string-set! s i char -> unspecified R5RS
I must be a valid index of S. STRING-SET! stores CHAR in
element I of S. Constant string literals appearing in code are
immutable; it is an error to use them in a STRING-SET!.
(define (f) (make-string 3 #\*))
(define (g) "***")
(string-set! (f) 0 #\?) ==> *unspecified*
(string-set! (g) 0 #\?) ==> *error*
(string-set! (symbol->string 'immutable)
3
#\?) ==> *error*
string-fill! s char [start end] -> unspecified R5RS+
Stores CHAR in every element of S.
STRING-FILL is extended from the R5RS definition to take optional
START/END arguments.
*** Comparison
==============
string-compare s1 s2 proc< proc= proc> [start1 end1 start2 end2] -> values
string-compare-ci s1 s2 proc< proc= proc> [start1 end1 start2 end2] -> values
Apply PROC<, PROC=, or PROC> to the mismatch index, depending
upon whether S1 is less than, equal to, or greater than S2.
The "mismatch index" is the largest index i such that for
every 0 <= j < i, s1[j] = s2[j] -- that is, i is the first
position that doesn't match.
STRING-COMPARE-CI is the case-insensitive variant. Case-insensitive
comparison is done by case-folding characters with the operation
(char-downcase (char-upcase c))
where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified
by Unicode's UnicodeData.txt table:
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2. The mismatch index is always an index into S1;
in the case of PROC=, it is always END1; we observe the protocol
in this redundant case for uniformity.
(string-compare "The cat in the hat" "abcdefgh"
values values values
4 6 ; Select "ca"
2 4) ; & "cd"
=> 5 ; Index of S1's "a"
Comparison is simply done on individual code-points of the string.
True text collation is not handled by this SRFI.
string= s1 s2 [start1 end1 start2 end2] -> boolean
string<> s1 s2 [start1 end1 start2 end2] -> boolean
string< s1 s2 [start1 end1 start2 end2] -> boolean
string> s1 s2 [start1 end1 start2 end2] -> boolean
string<= s1 s2 [start1 end1 start2 end2] -> boolean
string>= s1 s2 [start1 end1 start2 end2] -> boolean
These procedures are the lexicographic extensions to strings of the
corresponding orderings on characters. For example, STRING< is the
lexicographic ordering on strings induced by the ordering CHAR<? on
characters. If two strings differ in length but are the same up to
the length of the shorter string, the shorter string is considered to
be lexicographically less than the longer string.
The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2.
Comparison is simply done on individual code-points of the string.
True text collation is not handled by this SRFI.
string-ci= s1 s2 [start1 end1 start2 end2] -> boolean
string-ci<> s1 s2 [start1 end1 start2 end2] -> boolean
string-ci< s1 s2 [start1 end1 start2 end2] -> boolean
string-ci> s1 s2 [start1 end1 start2 end2] -> boolean
string-ci<= s1 s2 [start1 end1 start2 end2] -> boolean
string-ci>= s1 s2 [start1 end1 start2 end2] -> boolean
Case-insensitive variants.
Case-insensitive comparison is done by case-folding characters with
the operation
(char-downcase (char-upcase c))
where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified
by Unicode's UnicodeData.txt table:
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
string-hash s [bound start end] -> integer
string-hash-ci s [bound start end] -> integer
Compute a hash value for the string S. BOUND is a non-negative
exact integer specifying the range of the hash function. A positive
value restricts the return value to the range [0,BOUND).
If BOUND is either zero or not given, the implementation may use an
implementation-specific default value, chosen to be as large as
is efficiently practical. For instance, the default range might be chosen
for a given implementation to map all strings into the range of
integers that can be represented with a single machine word.
The optional start/end indices restrict the hash operation to the
indicated substring of S.
STRING-HASH-CI is the case-insensitive variant. Case-insensitive
comparison is done by case-folding characters with the operation
(char-downcase (char-upcase c))
where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified
by Unicode's UnicodeData.txt table:
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
Invariants:
(<= 0 (string-hash s b) (- b 1)) ; When B > 0.
(string= s1 s2) => (= (string-hash s1 b) (string-hash s2 b))
(string-ci= s1 s2) => (= (string-hash-ci s1 b) (string-hash-ci s2 b))
A legal but nonetheless discouraged implementation:
(define (string-hash s . other-args) 1)
(define (string-hash-ci s . other-args) 1)
Rationale: allowing the user to specify an explicit bound simplifies user
code by removing the mod operation that typically accompanies every hash
computation, and also may allow the implementation of the hash function to
exploit a reduced range to efficiently compute the hash value. E.g., for
small bounds, the hash function may be computed in a fashion such that
intermediate values never overflow into bignum integers, allowing the
implementor to provide a fixnum-specific "fast path" for computing the
common cases very rapidly.
*** Prefixes & suffixes
=======================
string-prefix-length s1 s2 [start1 end1 start2 end2] -> integer
string-suffix-length s1 s2 [start1 end1 start2 end2] -> integer
string-prefix-length-ci s1 s2 [start1 end1 start2 end2] -> integer
string-suffix-length-ci s1 s2 [start1 end1 start2 end2] -> integer
Return the length of the longest common prefix/suffix of the two strings.
For prefixes, this is equivalent to the "mismatch index" for the strings
(modulo the STARTi index offsets).
The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2.
STRING-PREFIX-LENGTH-CI and STRING-SUFFIX-LENGTH-CI are the
case-insensitive variants. Case-insensitive comparison is done by
case-folding characters with the operation
(char-downcase (char-upcase c))
where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified
by Unicode's UnicodeData.txt table:
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
Comparison is simply done on individual code-points of the string.
string-prefix? s1 s2 [start1 end1 start2 end2] -> boolean
string-suffix? s1 s2 [start1 end1 start2 end2] -> boolean
string-prefix-ci? s1 s2 [start1 end1 start2 end2] -> boolean
string-suffix-ci? s1 s2 [start1 end1 start2 end2] -> boolean
Is S1 a prefix/suffix of S2?
The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2.
STRING-PREFIX-CI? and STRING-SUFFIX-CI? are the case-insensitive variants.
Case-insensitive comparison is done by case-folding characters with the
operation
(char-downcase (char-upcase c))
where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified
by Unicode's UnicodeData.txt table:
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
Comparison is simply done on individual code-points of the string.
*** Searching
=============
string-index s char/char-set/pred [start end] -> integer or #f
string-index-right s char/char-set/pred [start end] -> integer or #f
string-skip s char/char-set/pred [start end] -> integer or #f
string-skip-right s char/char-set/pred [start end] -> integer or #f
STRING-INDEX (STRING-INDEX-RIGHT) searches through the string from the
left (right), returning the index of the first occurrence of a character
which
- equals CHAR/CHAR-SET/PRED (if it is a character);
- is in CHAR/CHAR-SET/PRED (if it is a character set);
- satisfies the predicate CHAR/CHAR-SET/PRED (if it is a procedure).
If no match is found, the functions return false.
The START and END parameters specify the beginning and end indices of
the search; the search includes the start index, but not the end index.
Be careful of "fencepost" considerations: when searching right-to-left,
the first index considered is