generated from gonzalezulises/Curso-de-Ciencia-de-Datos
-
Notifications
You must be signed in to change notification settings - Fork 0
/
20_regex_reference.py
211 lines (145 loc) · 5.6 KB
/
20_regex_reference.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
'''
REFERENCE GUIDE: Regular Expressions
'''
'''
Rules for Searching:
Search proceeds through string from start to end, stopping at first match
All of the pattern must be matched
Basic Patterns:
Ordinary characters match themselves exactly
. matches any single character except newline \n
\w matches a word character (letter, digit, underscore)
\W matches any non-word character
\b matches boundary between word and non-word
\s matches single whitespace character (space, newline, return, tab, form)
\S matches single non-whitespace character
\d matches single digit (0 through 9)
\t matches tab
\n matches newline
\r matches return
\ match a special character, such as period: \.
Basic Python Usage:
match = re.search(r'pattern', string_to_search)
Returns match object
If there is a match, access match using match.group()
If there is no match, match is None
Use 'r' in front of pattern to designate a raw string
'''
import re
s = 'my 1st string!!'
match = re.search(r'my', s) # returns match object
if match: # checks whether match was found
print match.group() # if match was found, then print result
re.search(r'my', s).group() # single-line version (without error handling)
re.search(r'st', s).group() # 'st'
re.search(r'sta', s).group() # error
re.search(r'\w\w\w', s).group() # '1st'
re.search(r'\W', s).group() # ' '
re.search(r'\W\W', s).group() # '!!'
re.search(r'\s', s).group() # ' '
re.search(r'\s\s', s).group() # error
re.search(r'..t', s).group() # '1st'
re.search(r'\s\St', s).group() # ' st'
re.search(r'\bst', s).group() # 'st'
'''
Repetition:
+ 1 or more occurrences of the pattern to its left
* 0 or more occurrences of the pattern to its left
? 0 or 1 occurrence of the pattern to its left
+ and * are 'greedy': they try to use up as much of the string as possible
Add ? after + or * to make them 'lazy': +? or *?
'''
s = 'sid is missing class'
re.search(r'miss\w+', s).group() # 'missing'
re.search(r'is\w+', s).group() # 'issing'
re.search(r'is\w*', s).group() # 'is'
s = '<h1>my heading</h1>'
re.search(r'<.+>', s).group() # '<h1>my heading</h1>'
re.search(r'<.+?>', s).group() # '<h1>'
'''
Positions:
^ match start of a string
$ match end of a string
'''
s = 'sid is missing class'
re.search(r'^miss', s).group() # error
re.search(r'..ss', s).group() # 'miss'
re.search(r'..ss$', s).group() # 'lass'
'''
Brackets:
[abc] match a or b or c
\w, \s, etc. work inside brackets, except period just means a literal period
[a-z] match any lowercase letter (dash indicates range unless it's last)
[abc-] match a or b or c or -
[^ab] match anything except a or b
'''
s = 'my email is [email protected]'
re.search(r'\w+@\w+', s).group() # 'doe@gmail'
re.search(r'[\w.-]+@[\w.-]+', s).group() # '[email protected]'
'''
Lookarounds:
Lookahead matches a pattern only if it is followed by another pattern
100(?= dollars) matches '100' only if it is followed by ' dollars'
Lookbehind matches a pattern only if it is preceded by another pattern
(?<=\$)100 matches '100' only if it is preceded by '$'
'''
s = 'Name: Cindy, 30 years old'
re.search(r'\d+(?= years? old)', s).group() # '30'
re.search(r'(?<=Name: )\w+', s).group() # 'Cindy'
'''
Match Groups:
Parentheses create logical groups inside of match text
match.group(1) corresponds to first group
match.group(2) corresponds to second group
match.group() corresponds to entire match text (as usual)
'''
s = 'my email is [email protected]'
match = re.search(r'([\w.-]+)@([\w.-]+)', s)
if match:
match.group(1) # 'john-doe'
match.group(2) # 'gmail.com'
match.group() # '[email protected]'
'''
Finding All Matches:
re.findall() finds all matches and returns them as a list of strings
list_of_strings = re.findall(r'pattern', string_to_search)
If pattern includes parentheses, a list of tuples is returned
'''
s = 'emails: [email protected], [email protected]'
re.findall(r'[\w.-]+@[\w.-]+', s) # ['[email protected]', '[email protected]']
re.findall(r'([\w.-]+)@([\w.-]+)', s) # [('joe', 'gmail.com'), ('bob', 'gmail.com')]
'''
Option Flags:
Options flags modify the behavior of the pattern matching
default: matching is case sensitive
re.IGNORECASE: ignore uppercase/lowercase differences ('a' matches 'a' or 'A')
default: period matches any character except newline
re.DOTALL: allow period to match newline
default: within a string of many lines, ^ and $ match start and end of entire string
re.MULTILINE: allow ^ and $ to match start and end of each line
Option flag is third argument to re.search() or re.findall():
re.search(r'pattern', string_to_search, re.IGNORECASE)
re.findall(r'pattern', string_to_search, re.IGNORECASE)
'''
re.findall(r'\w+@ga\.co', s) # ['[email protected]']
re.findall(r'\w+@ga\.co', s, re.IGNORECASE) # ['[email protected]', '[email protected]']
'''
Substitution:
re.sub() finds all matches and replaces them with a specified string
new_string = re.sub(r'pattern', r'replacement', string_to_search)
Replacement string can refer to text from matching groups:
\1 refers to group(1)
\2 refers to group(2)
etc.
'''
s = 'sid is missing class'
re.sub(r'is ', r'was ', s) # 'sid was missing class'
s = 'emails: [email protected], [email protected]'
re.sub(r'([\w.-]+)@([\w.-]+)', r'\[email protected]', s) # 'emails: [email protected], [email protected]'
'''
Useful to know, but not covered above:
re.split() splits a string by the occurrences of a pattern
re.compile() compiles a pattern (for improved performance if it's used many times)
A|B indicates a pattern that can match A or B
'''