forked from fastflow/fastflow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Changelog.txt
279 lines (278 loc) · 15 KB
/
Changelog.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
* v.3.0.0
* v.2.2 (TAG v.2.2)
- main developemnt repository migrated to GitHub, the sourceforge SVN
will be no longer updated. GOTO https://github.com/fastflow/fastflow
- bugfix
- Added TAG v.2.2
* v.2.1.3
- introduced a the possibility to add a "manager channel" in the farm emitter for reconfiguring farm's workers (see tests/test_multi_output5.cpp)
- the initial barrier can be avoided by undefining the FF_INITIAL_BARRIER define in config.hpp
- added makeFarm, makePipe, makeMap, makeMasterWorker, ...
- move constructor added in several classes (yet to be completed)
- Divide&Conquer pattern implementation (DAC)
- ffStats produces energy consumption statistics by using Mammut (MAchine Micro Management UTilities -- http://danieledesensi.github.io/mammut/)
- multi-input/output node management improved
- improved REPARA layer
* v.2.1.2
- possibility to switch between BLOCKING and NON-BLOCKING mode at run-time
- minor improvements for the OpenCL support
- minor enhancement of the fastflow statistics
- improved REPARA KERNEL LAYER support
* v.2.1.1 (not released)
- minor fixes
* v.2.1.0
- improved cmake compilation
- ff_tpcnode and ff_tpcallocator added
- added FPGA and DSP support through the ThreadPoolComposer (TPC) framework
developed by Andreas Koch's group at Technical University of Darmstadt (TUD).
* v.2.0.9 (not released)
- removed almost all explicit atomic operations based on assembler code and added
support to C++11 atomic operations
- nodeInit and nodeEnd added in the ff_node class as virtual methods
- added ff_selectNode, a simple pattern for selecting between different ff_node
- stencilReduceOCL used for map, reduce and map-reduce
- OpenCL support completely re-implemented
* 10/06/15 - v.2.0.8
- added BLOCKING_MODE compile time optione. Now it is possible to select
a blocking run-time. The default mode is non-blocking.
* - v.2.0.7 (not released)
- new high level interface for pipeline and farm: ff_Pipe, ff_Farm
- revised ff_taskf pattern
- added some new tests
* 15/11/14 - v.2.0.6
- removed warnings when compiling with clang
- fixed a problem in the ff_mapOCL
- introduced "explicit" for some constructors in order to prevent the compiler
to use implicit conversions
- fixed problems in the fftree usage
* 22/08/14 - v.2.0.5
- added the possibility to select the spinBarrier barrier in the parallel-for pattern (by default is enabled blocking barrier)
- revised version of OpenCL patterns (still experimental)
- OpenCL map/reduce/mapreduce patters are now in the mapOCL.hpp file
- spinBarrier::barrierSetup revised
- *** changed 'thaw' semantic ***.
If the function is called and the current thread is not in the frozen
state, then the thread will not be put to sleep when it reaches
the next freezing point.
- introduced the ff_taskf pattern (first non-recursive version)
- moved some code from mdf.hpp to tasks_internals.hpp
- if set_scheduling_ondemand is called with 0 as parameter then the
scheduling policy is NOT set to ondemand (i.e. the value should be >=1).
- introduced ff_minode_t and ff_monode_t, changed the interface (now typed)
of the ff_Map<S,T>
- introduced ff_node_t a wrapper template class around the ff_node class
- the ff_send_out now works also inside the Emitter and Collector of the
ordered farm (ofarm)
- removed FF_PIPE macros
- if spinWait is set to true in the parallel for, than spinBarrier is used as loop barrier
- removed barrier implementation from node.hpp and created a new file barrier.hpp
- fixed a bug in the init_data_static function (ParallelFor related)
- fixed some problems appearing when remove_collector is called
- an ff_minode can be used in place of the ff_node (i.e. when there is no need to
use an ff_minode, indeed)
- first version of the ParallelForPipeReduce pattern
- fixed problem with run_then_freeze for the ff_minode and ff_monode
* 24/01/14 - v.2.0.4
- poolEvolution interface improved
- introduced the threadPause() method in the ParallelFor/Reduce patterns.
It may be used to put threads to sleep when the spinwait flag is set
and the parallel for is not used for a while
- separated parallel_for_internals.hpp and parallel_for.hpp
- added the possibility to control the grain size in the static scheduling
policy of the ParallelFor(Reduce) pattern(s)
- ff_node::eosnotify interface changed:
virtual void eosnotify(int id=-1) --> virtual void eosnotify(ssize_t id=-1)
- added the poolEvolutionCUDA pattern with some simple tests (beta version)
- added the poolEvolution pattern with some simple tests (beta version)
- stencilReduceCUDA added
- moved stencil.hpp to stencilReduce.hpp
* 27/01/14 - v.2.0.3
- introduced GO_OUT: a new reserved message (see config.hpp)
- optimized ParallelFor and ParallelReduce computation when executed inside
a serial loop
- changed the Barrier implementation (spinBarrier still the same)
- added the ff_realNumCores() and the ff_numSockets() utility functions
- a new lock-free task scheduling for the ParallelFor has been implemented.
The new scheduling policy is used when the scheduler thread (i.e. farm's emitter
thread) is not being started. This scheduling may be enabled at compile time
by compiling with the flag -DNO_PARFOR_SCHEDULER_THREAD or at run-time by
calling ParallelFor::disableScheduler.
- added atomic_flag-based spin-lock (only for C++11-compliant compilers)
- defined ParallelFor and ParallelForReduce classes
- removed initial underscore in the preprocessor conditional guards in
order to be compliant to the C++ standard
- more work on parallel_for pattern to avoid starting the scheduler thread
when there are not enough cores
- fixed bug in the forall_Scheduler::init_data method
- added the possibility to freeze and thaw all farm threads and then
restart a lower number of workers in a farm having a collector thread.
(see tests/test_stopstartall.cpp)
- fixed an issue in the ofarm skeleton
- ParallelForReduce optimized for the case when it is called multiple times
in a sequential loop
- fixed possible node termination problem (in case the freezing flag was set)
- commented out the optimization heuristic in the parallel_for pattern
since it does not work well with high unbalanced loop iterations
- parallel_for/reduce: fixed some issues and optimized the case when it is
used with just 1 thread
- ff_send_out_to modified in order to better work with ondemand scheduling
policy
- changed some int types with size_t, moved to svector whenever possible
- more possible combinations of farm, pipeline and feedback using
ff_minode and ff_monode
- better multi-input and multi-output support (ff_minode ff_monode)
- ff_buffernode introduced
- more low-level examples showing how to build "complex" graph of nodes
- better estimation of the bounded and unbounded SPSC queue length.
Thanks to M. Moniruzzaman ([email protected]) for the patch.
- fixed some problems with the current version in the SWPS3 example
(thanks to Maurizio Drocco)
- introduced the ff_mdf/stencil2D/ff_graphsearch patterns (experimental)
- fixed a bug in the pipeline when used as an accelerator
* 08/09/13 - v.2.0.2
- added the possibility to freeze and thaw one single worker thread
in a task-farm skeleton (see tests/test_stopstartthreads2.cpp)
- added end_callback in case an external tool wants to be notified
of the skeleton tree termination
- introduced FF_ESAVER (Energy SAVER) compiler flag (experimental) (thanks to Mehdi Goli )
- ff_relax now uses nanosleep instead of usleep
- first full porting to ARM processor (thanks to Mauro Mulatero)
- added new interfaces for the ff_farm
- added a new construct ff_pipe only for C++11
- fixed some problems with cmake
- added ff_send_out_to as a method of the ff_loadbalancer class
- added parallel_for.hpp which implements 'OpenMP-like' parallel_for
- added cleanup_workers method in the farm skeleton
- removed fallback function in the farm emitter
- removed many warnings when compiling with the Intel compiler
- fixed get_my_id() problem for pipeline
* 23/12/13 - v.2.0.1
- improved cmake compilation
- added experimental OFED code
- added some layer1/2 tests
- map pattern using CUDA: ff_mapCUDA
- map and reduce patterns using OpenCL: ff_mapOCL and ff_reduceOCL
- Added multi-input node: ff_minode class implemented
- Added multi-output node: ff_monode class inplemented
- Distributed version ported and tested on Win7 64-bit.
- little fix on allocator.hpp, freesegment now release aligned memory
to avoid stack corruption on Win platform that requires symmetric
primitives for aligned memory allocation and deallocation.
- Winsock.h (windows.h) and winsock2.h incompatibility issue partially resolved.
<ff/dnode.h> should be included before other includes to avoid the problem. It
includes winsock2.h that is required by zeromq.
* 04/07/12 - v.2.0.0
- fixed compilation ploblems on Windows 7 64bit using Visual Studio 10
(cmake -G "Visual Studio 10 Win64" ...)
- added class ff_dinout that allows to have a single dnode with both
input and output external channels.
- added the class ff_ofarm to implement the ordered farm template
- NO_DEFAULT_MAPPING compilation flag added
- added the map template that can be used to compute map, map-reduce
and data-parallel patterns.
- Added the possibility to have multi-input Emitter in a farm template
(useful to collapse in a single thread the Collector and the Emitter
in a 2-stage pipeline of 2 farms).
- Changed the thread start order in the farm skeleton. The old one was:
collector (if present), all workers, emitter; now it is: emitter,
all workers, and finally the collector (if present).
- Added the class threadMapper for implementing threads mapping.
- Implemented the spinBarrier which has a lower overhead when the
number of thread is high.
- Fixed problems in ffTime and ffwTime for application with nested patterns.
- The distributed communication patterns implemented so far are:
Unicast, Broadcast, Scatter, Gather_All, FromAny, OnDemand
- first distributed implementation based on ZeroMQ message library
- dnode class added, minor changes to node class
* not-released - v.1.2.0
- added a new implementation of the unbounded MPMC queue
- added an implementation of the bounded Multi-Producer/Multi-Consumer
queue by Dmitry Vyukov (www.1024cores.net)
- implemented all_gather
- moved from int to unsigned long for embedded atomic operations
- performance improvements in the ff_allocator for the test one to many
- Micheal Scott 2-locks MPMC queue implementation (two methods mp_push
and mp_pop have been added to the dSPSC queue)
* 22/04/11 - v.1.1.1
- accelerator support implemented for the pipeline
- ff_send_out improvment: more control over the number of retries, return
value added
- barrier protocol re-worked, now is possible to create and to start a
pipeline or a farm (or an accelerator) from inside a pipeline's node or
a farm's worker/emitter/collector
- added ffwTime which considers only the time spent in the svc methods
- changed the value of FF_GO_ON
- fixed SWSR_Ptr_Buffer length method (unsigned long --> long)
- fixed a bug which prevents to have a farm whose workers are D&C skeleton
(see test_multi_masterworker test)
- added the possibility to define workers' input queue size for the
on-demand scheduling policy (default value is 1 slot)
* 22/04/11 - v.1.1.0
- MSqueue reworked, now it is not in the experimental state.
The queue is working also on Windows OS and OSX. Removed DCAS and used CAS.
- removed some performance problems on Windows OS.
- slightly modified MPMC interface and usage
- added Quicksort algo implementation which uses MPMC queue
(both in blocking and non-blocking version)
- fixed some compilation problem under OSx
* 01/04/11 - v.1.0.9
- fixes return type error in node.hpp ubuffer.hpp and buffer.hpp
- added the method get_channel_id() in the emitter and collector class
the method returns the id of the worker thread from which an input task
is coming from
- added the fibonacci and the quicksort examples
- added basic interface for pinning threads to CPUs
- added multipush and mpush to SWSR buffer (experimental code)
- added mpush to uSWSR queue (experimental code)
- more tests
- improved cmake test
- improved performance for fine grain D&C computation (farm+feedback)
by changing the losetime_in method in the lb class.
- added Multi-Producers/Multi-Consumers queue (MSqueue by Michael and Scott)
(EXPERIMENTAL code )
- added abstraction_dcas from liblfds (www.liblfds.org), no powerpc support yet!
- fixed some bugs in the allocator
- posix_memalign implemented in the allocator
- ticket-spinlock from Linux kernel (EXPERIMENTAL code only for Linux)
- added two lock-based methods in uSWSR queue (mp_push and mc_pop)
- fixed missing files in CMakeLists.txt
- fixed deadlock situation when pipeline with feedback channel is created (torus pattern)
- implemented a Deferred Reclamation strategy based on number of free in the allocator
- added the ff_queue implementation of a SPSC queue by Dmitry Vyukov (www.1024cores.net)
- first porting on Windows OS
* 01/09/10 - v.1.0.0
- New release
- some bugs fixed
- unbounded SWSR queue improved (removed all locks)
- added dynqueue (dynamic list-based queue)
- all .hpp files moved into the ff directory
- more tests added
- cmake compilation support (thanks to Fedor Sakharov)
- improved the accelerator sturcture (added FF_EOS_NOFREEZE tag)
- added the management of second level streams
- added the 'stop' method in the farm and pipeline skeleton
- fixed ffStats method when run_then_freeze is called multiple times
- one memory leak removed
- removed some warnings related to strict-aliasing
* 22/03/10 - v.1.0.0rc2
- Moved to LGPLv3 license.
- More tests and applications (including pbzip2).
- FastFlow Allocator improved.
- Fixed some minor bugs.
- Added the broadcast_task method.
* 03/02/10 - v.1.0.0rc1
- Minor API revision.
- Improved FastFlow accelerator support.
- More tests and applications,(Nokia QT Mandelbrot and NQueens)
- Simple execution trace support.
- Allocator improved.
* 16/11/09 - v.0.9.7
- Major API revision: patterns are no longer object factories but objects.
- Support for arbitrary nesting of pipe, farm, and loop at high-level layer.
- First version of the Divide&Conquer pattern (no examples yet).
- First version of the FastFlow's Accelerator.
* 19/10/09 - v.0.6.1
- FastFlow-swps3 Smith-Waterman application added.
* 15/10/09 - v.0.5.0
- First release.
- Support for farm skeleton/pattern.