forked from inducer/loopy
-
Notifications
You must be signed in to change notification settings - Fork 1
/
MEMO
361 lines (225 loc) · 8.52 KB
/
MEMO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
Documentation Notes
^^^^^^^^^^^^^^^^^^^
- Need to clarify fundamental difference between constants baked into code
and things that remain variable. (ISL parameters, symbolic shapes)
Things to consider
^^^^^^^^^^^^^^^^^^
- Dependencies are pointwise for shared loop dimensions
and global over non-shared ones (between dependent and ancestor)
- multiple insns could fight over which iname gets local axis 0
-> complicated optimization problem
- Every loop in loopy is opened at most once.
Too restrictive?
- Why do precomputes necessarily have to duplicate the inames?
-> because that would be necessary for a sequential prefetch
- Cannot do slab decomposition on inames that share a tag with
other inames
-> Is that reasonable?
- Entering a loop means:
- setting up conditionals related to it (slabs/bounds)
- allowing loops nested inside to depend on loop state
- Not using all hw loop dimensions causes an error, as
is the case for variant 3 in the rank_one test.
- Measure efficiency of corner cases
- Loopy as a data model for implementing custom rewritings
- We won't generate WAW barrier-needing dependencies
from one instruction to itself.
- Loopy is semi-interactive.
- Limitation: base index for parallel axes is 0.
- Dependency on order of operations is ill-formed
- Dependency on non-local global writes is ill-formed
- No substitution rules allowed on lhs of insns
To-do
^^^^^
- Kernel fusion
- when are link_inames, duplicate_inames safe?
- rename IndexTag -> InameTag
- Data implementation tags
- turn base_indices into offset
- vectorization
- write_image()
- change_arg_to_image (test!)
- Make tests run on GPUs
- Test array access with modulo
- Derive all errors from central hierarchy
- Provide context for more errors?
- Allow mixing computed and stored strides
Fixes:
- applied_iname_rewrites tracking for prefetch footprints isn't bulletproof
old inames may still be around, so the rewrite may or may not have to be
applied.
- Group instructions by dependency/inames for scheduling, to
increase sched. scalability
- What if no universally valid precompute base index expression is found?
(test_intel_matrix_mul with n = 6*16, e.g.?)
- If finding a maximum proves troublesome, move parameters into the domain
Future ideas
^^^^^^^^^^^^
- subtract_domain_lower_bound
- Storage sharing for temporaries?
- Kernel splitting (via what variables get computed in a kernel)
- Put all OpenCL functions into mangler
- Fuse: store/fetch elimination?
- Array language
- reg rolling
- When duplicating inames, use iname aliases to relieve burden on isl
- (Web) UI
- Check for unordered (no-dependency) writes to the same location
- Vanilla C string instructions?
- Barriers for data exchanged via global vars?
- Float4 joining on fetch/store?
- Better for loop bound generation
-> Try a triangular loop
- Eliminate the first (pre-)barrier in a loop.
- Generate automatic test against sequential code.
- Reason about generated code, give user feedback on potential
improvements.
- Convolutions, Stencils
- DMA engine threads?
- Try, fix indirect addressing
- Nested slab decomposition (in conjunction with conditional hoisting) could
generate nested conditional code.
- Better code for strides.
Dealt with
^^^^^^^^^^
- How can one automatically generate something like microblocks?
-> Some sort of axis-adding transform?
- RuleAwareIdentityMapper
extract_subst -> needs WalkMapper [actually fine as is]
padding [DONE]
replace make_unique_var_name [DONE]
join_inames [DONE]
duplicate_inames [DONE]
split_iname [DONE]
CSE [DONE]
- rename iname
- delete unused inames
- Expose iname-duplicate-and-rename as a primitive.
- make sure simple side effects work
- Loop bounds currently may not depend on parallel dimensions
Does it make sense to relax this?
- Streamline argument specification
- syntax for linear array access
- Test divisibility constraints
- Test join_inames
- Divisibility, modulo, strides?
-> Tested, gives correct (but suboptimal) code.
- *_dimension -> *_iname
- Use gists (why do disjoint sets arise?)
- Automatically verify that all array access is within bounds.
- : (as in, Matlab full-slice) in prefetches
- Add dependencies after the fact
- Scalar insn priority
- ScalarArg is a bad name
-> renamed to ValueArg
- What to do about constants in codegen? (...f suffix, complex types)
-> dealt with by type contexts
- relating to Multi-Domain [DONE]
- Reenable codegen sanity check. [DONE]
- Incorporate loop-bound-mediated iname dependencies into domain
parenthood. [DONE]
- Make sure that variables that enter into loop bounds are only written
exactly once. [DONE]
- Make sure that loop bound writes are scheduled before the relevant
loops. [DONE]
- add_prefetch tagging
- nbody GPU
-> pending better prefetch spec
- Prefetch by sample access
- How is intra-instruction ordering of ILP loops going to be determined?
(taking into account that it could vary even per-instruction?)
- Sharing of checks across ILP instances
- Differentiate ilp.unr from ilp.seq
- Allow complex-valued arithmetic, despite CL's best efforts.
- "No schedule found" debug help:
- Find longest dead-end
- Automatically report on what hinders progress there
- CSE should be more like variable assignment
- Deal with equality constraints.
(These arise, e.g., when partitioning a loop of length 16 into 16s.)
- dim_{min,max} caching
- Exhaust the search for a no-boost solution first, before looking
for a schedule with boosts.
- Pick not just axis 0, but all axes by lowest available stride
- Scheduler tries too many boostability-related options
- Automatically generate testing code vs. sequential.
- If isl can prove that all operands are positive, may use '/' instead of
'floor_div'.
- For forced workgroup sizes: check that at least one iname
maps to them.
- variable shuffle detection
-> will need unification
- Dimension joining
- user interface for dim length prescription
- Restrict-to-sequential and tagging have nothing to do with each other.
-> Removed SequentialTag and turned it into a separate computed kernel
property.
- Just touching a variable written to by a non-idempotent
instruction makes that instruction also not idempotent
-> Idempotent renamed to boostable.
-> Done.
- Give the user control over which reduction inames are
duplicated.
- assert dependencies <= parent_inames in loopy/__init__.py
-> Yes, this must be the case.
-> If you include reduction inames.
- Give a good error message if a parameter assignment in get_problems()
is missing.
- Slab decomposition for ILP
-> I don't think that's possible.
- It is hard to understand error messages that referred to instructions that
are generated during preprocessing.
-> Expose preprocessing to the user so she can inspect the preprocessed
kernel.
- Which variables need to be duplicated for ILP?
-> Only reduction
- implemented_domain may end up being smaller than requested in cse
evaluations--check that!
- Allow prioritization of loops in scheduling.
- Make axpy better.
- Screwy lower bounds in slab decomposition
- reimplement add_prefetch
- Flag, exploit idempotence
- Some things involving CSEs might be impossible to schedule
a[i,j] = cse(b[i]) * cse(c[j])
- Be smarter about automatic local axis choice
-> What if we run out of axes?
- Implement condition hoisting
(needed, e.g., by slab decomposition)
- Check for non-use of hardware axes
- Slab decomposition for parallel dimensions
- implement at the outermost nesting level regardless
- bound *all* tagged inames
- can't slab inames that share tags with other inames (for now)
- Make syntax for iname dependencies
- make syntax for insn dependencies
- Implement get_problems()
- CSE iname duplication might be unnecessary?
(don't think so: It might be desired to do a full fetch before a mxm k loop
even if that requires going iterative.)
- Reduction needs to know a neutral element
- Types of reduction variables?
- Generalize reduction to be over multiple variables
- duplicate_dimensions can be implemented without having to muck around
with individual constraints:
- add_dims
- move_dims
- intersect
Should a dependency on an iname be forced in a CSE?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Local var:
l | n
g | y
dl | Err
d | Err
Private var:
l | y
g | y
dl | Err
d | Err
dg: Invalid-> error
d: is duplicate
l: is tagged as local idx
g: is tagged as group idx
Raise error if dl is targeting a private variable, regardless of whether it's
a dependency or not.