-
Notifications
You must be signed in to change notification settings - Fork 1
/
MODIFICATIONS
1023 lines (1013 loc) · 34.7 KB
/
MODIFICATIONS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
TITLE
Parsing of Long Words
APPLICATION
mg-1, mg-2
TYPE
bug
REPORT
[email protected] - May 11th 1994
FIX
Tim Shimmin (no longer emailable) - August 9th 1994
CLAIM
Mg didn't handle long words properly; it crashed.
PROBLEM
Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
MAXLONGWORD on iterating thru the string and storing into
a word. MAXLONGWORD = 8192.
However, mg strings generally store the length in the first
byte limiting them to 255 characters. The word which was passed
to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
which is as large as we should get anyway. Thus when accessing
a larger word than 255 chars, PARSE_LONG_WORD would allow it
(less than 8192) and would try storing beyond the array limit.
SOLUTION
The author can't remember why PARSE_LONG_WORD was used and what
the significance of MAXLONGWORD = 8192 is.
So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
uses MAXSTEMLEN as its limit.
FILES
* words.h
* invf.pass1.c
* invf.pass2.c
* ivf.pass1.c
* ivf.pass2.c
* query.ranked.c
*************************************************************
TITLE
Use of Lovins stemmer
APPLICATION
mg-1
TYPE
improve
REPORT
local - 1994
FIX
Linh Huynh (no longer emailable) - 1994
CLAIM
Stemming was done naively.
PROBLEM
Only a few types of words and their endings
were considered.
SOLUTION
Replacement with a more elaborate "known" stemmer by Lovins.
The algorithm is described in:
J.B. Lovins, "Development of a Stemming Algorithm",
Mechanical Translation and Computational Linguistics, Vol 11,1968.
FILES
* stem.c
* stem.h
*************************************************************
TITLE
Different term parsing
APPLICATION
mg-1
TYPE
bug
REPORT
Tim Shimmin (no longer emailable) - 23 Aug 1994
FIX
Tim Shimmin (no longer emailable) - 23 Aug 1994
CLAIM
Boolean queries did not extract words/terms using the
same method as is done at inverted-file creation and
as is used for rank query parsing.
PROBLEM
The hand-written lex. analyser, query_lex, which is called by
the boolean query parser was not calling a common
word-extraction routine as used by the rest of mg.
This would be ok if the code did the same things - but they didn't.
Query_lex, for instance, did NOT place any limit on the
number of digits in a term.
Of even more concern, it would allow arbitrary sized words
although it used Pascal style strings which store the length
in the first byte and can therefore only be 255 characters in length.
SOLUTION
Query_lex in "query.bool.y", was modified to call the routine
PARSE_STEM_WORD which is also used by text-inversion routines and
ranking query routines.
Now all terms are extracted by the same routine.
To do this, the end of the line buffer had to be noted as
PARSE_STEM_WORD requires a pointer to the end - which is the
safe thing to do (don't want to run over the end).
This meant I had to find the length of the query line buffer.
This was allocated in the file "read_line.c" by the routine,
"readline". Its size was the literal number 1024.
This was changed to a constant and placed in "read_line.h".
The definition for PARSE_STEM_WORD can be found in "words.h".
FILES
* query.bool.y
* query.bool.c (by bison)
* read_line.c
* read_line.h
*************************************************************
TITLE
Highlighting of query terms
APPLICATION
mg-1
TYPE
extend
REPORT
Tim Shimmin (no longer emailable) - Aug 94
FIX
Tim Shimmin (no longer emailable) - Sep 94
CLAIM
Difficult to feel happy that the query-result returned is
satisfying the query - need to look hard to find the queried words.
Need to show words in results using some highlighting method.
PROBLEM
No highlighting of query terms in results.
SOLUTION
Mgquery was previously outputting the decompressed text to a pager
such as "less(1)" or "more(1)".
(Except when redirected or piped elsewhere :)
So what was needed was some sort of highlight pager that instead of
displaying the text would also use some means for highlighting the
stemmed query words.
Two common forms of highlighting were chosen: underline and bolding.
These are supported by "less(1)" and possibly by "more(1)" by
using the backspace character.
A highlight pager will also need to know which words need to be
highlighted. Therefore, the code was modified to build up a
string of the stemmed query words for passing to the highlight pager.
Design Options:
---------------
* Could do text filtering in mgquery before passing out to pager.
Instead I pipe to a separate process, the "hilite_words" pager,
which filters and pipes into less/more.
* Could do different highlighting or a combination.
* Could use a different structure for storing the query words other
than the hash-table I used.
FILES
* Makefile - to include hilite_words target
* mg_hilite_words.c
* mgquery.c
* mgquery.1
* query.bool.y
* query.ranked.c
* environment.c
* environment.h
* backend.h
*************************************************************
TITLE
Mg_compression_dict did premature free
APPLICATION
mg-1
TYPE
bug
REPORT
[email protected] - 23 Sep 94
FIX
[email protected] - 23 Sep 94
CLAIM
mg_compression_dict dumped core in
file: mg_compression_dict.c
function: Write_data
line: int codelen = hd->clens[i];
PROBLEM
Huffman data, hd, was freed *before* it was accessed again.
SOLUTION
The freeing of hd has been moved to after all accesses
(just before returning).
FILES
* mg_compression_dict.c
*************************************************************
TITLE
Boolean tree optimising rewrite
APPLICATION
mg-1
TYPE
bug
REPORT
[email protected] - 23 Sep 94
FIX
Tim Shimmin (no longer emailable) - Oct 94
CLAIM
"I am still getting core dump in "and" queries in mgquery,
where the first word does not exist, but the second one does."
PROBLEM
Having freed a particular node, it tried to refree it and
access one of its fields.
I.e. code-fragment...
FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
FreeNodes(next);
FreeNodes(CHILD(base));
/* but CHILD(base) has already been freed above */
/* if the node was the first one in the list */
SOLUTION
A number of things in the code seemed a bit dubious to me.
So I have rewritten the boolean optimising stage and abstracted out
the various stages - each file starts with "bool".
Boolean query optimising seems to be a tricky problem.
It is not clear that putting an expression into a certain form will
actually simplify it and whether simplification means faster querying.
I have converted a given boolean expression into DNF
(Disjunctive Normal Form). "And not" nodes, which are readily apparent
in DNF, are converted to "diff" nodes. I have only applied the idempotency
laws involving TRUE and FALSE, and not the ones requiring matching of
expressions - it is a potentially more complicated problem.
The optimiser has been tested by playing with "bool_tester", and if you are
having a crash or problem in a boolean query it would be worth testing the
query on the "bool_tester." The token "*" stands for TRUE (or all documents)
and the token "_" stands for FALSE (or no documents). This should show the
expression before and after optimisation in an ascii tree bracketting format.
FILES
* bool_tree.c
* bool_parser.y
* bool_optimiser.c
* bool_query.c
* bool_tester.c
* term_lists.c
*************************************************************
TITLE
Mgtic pixel placement
APPLICATION
mg-1
TYPE
bug
REPORT
Bruce McKenzie - [email protected] (21st Oct 1994)
FIX
CLAIM
mgtic crashed on certain files.
PROBLEM
Placing pixels outside of bitmap.
SOLUTION
Changed the putpixel routine to truncate at borders of the image.
FILES
* mgtic.c
*************************************************************
TITLE
Improved boolean tree optimising
APPLICATION
mg-1
TYPE
improve
REPORT
Tim Shimmin (no longer emailable) - 12/Dec/94
FIX
Tim Shimmin (no longer emailable) - 21/Dec/94, 14/Mar/95
CLAIM
Optimising by conversion to DNF is not necessarily such
a good idea - can actually slow things down.
PROBLEM
The distributive law used in converting to DNF
duplicates expressions.
SOLUTION
Introduce a query environment variable, optimise_type = 0 | 1 | 2.
Type 0 does nothing to the parse tree.
Type 2 does the DNF conversion.
Type 1 is the new default and does the following...
Do simple tree rearrangement like flattening.
Optimise for CNF queries.
FILES
* bool_query.c, .h
* bool_optimiser.c
* environment.c
* invf_get.c
* bool_tree.c, .h
* bool_tester.c
* lists.h
*************************************************************
TITLE
Similarity variants
APPLICATION
mg-2
TYPE
extend
REPORT
[email protected]/[email protected] - June 1994
FIX
Tim Shimmin (no longer emailable) - July 1994 .. Feb 1995
CLAIM
Can only use one type of similarity measure - the
standard cosine measure.
PROBLEM
See CITRI/TR-95-3 for more details.
The standard measure can be broken up into 7 components.
The 7 components are
Each one of these components has a number of alternatives.
The overall measure, S_qd, can also be altered.
Thus the particular similarity measure used can be specified
by an 8 dimensional vector.
What is desired is to be able to specify to mgquery an option
and a 8-digit string representing this vector (assuming that
any one alternative can have at most 9 (not using zero) variants).
SOLUTION
The programs which had to be modified were:
(i) mgquery,
(ii) mg_weights_build.
The other mg programs in existence store the text, indexing info,
and the basic statistics such as N, n, ft, fdt.
Other programs which had to be created were:
(i) mg_fmd_build,
(ii) mg_wt_build.
Mg_fmd_build will create the file to store the f_md statistic,
where f_md is the largest (maximum) f_dt of any term in document, d.
Mg_wt_build will create the file to store the w_t primitive.
It only creates this for the w_t variants 6-9 which would require
extra passes of invf at query time if they were not stored here.
For details on similarity changes for mgquery and mg_weights_build,
please see the other modification entries.
FILES
* mg_fmd_build.c
* mg_wt_build.c
* build_lib.c, build_lib.h
*************************************************************
TITLE
Similarity variants for mgquery
APPLICATION
mg-2
TYPE
extend
REPORT
[email protected]/[email protected] - June 1994
FIX
Tim Shimmin (no longer emailable) - July 1994 .. Feb 1995
CLAIM
"mgquery" needs to be altered to allow modification of
the similarity measure.
PROBLEM
See CITRI/TR-95-3 for more details.
SOLUTION
Most of the similarity measures, Sqd, are of the
form: Aqd
-----
Bqd
where Bqd is an expression involving Wd and possibly Wq,
where Aqd is a sum over the common document/query terms
of w_qt and w_dt.
Building of Aqd
===============
The calculation of Aqd is done in the file build_Aqd.c .
The functions for doing this used to be in the file invf_get.c .
Build_Aqd.c contains 4 different functions for building Aqd, each
of them building a different data structure:
(i) Array, (ii) Splay Tree, (iii) Hash Table, (iv) List Table.
Each of these routines seems to have been construction by modifications
to duplicated code. This is often the easiest way to construct variants
but is quite difficult to maintain consistency.
As the aim of the exercise was to try out different sim. measures for
retrieval effectiveness, I only modified the code that constructed
an array. This routine was called "CosineDecode"; I changed it to
"build_Aqd_Array." This change reflects the fact that we are only
calculating Aqd and this need not be used for the Cosine measure.
The other routines: "CosineDecodeSplay," "CosineDecodeHash," and
"CosineDecodeList" have been left unaltered - they need to be updated
in the future which would be best be done by abstracting out common code.
By the stage of building Aqd, the query terms have been looked up in
the inverted file dictionary and put into a list.
This list of common terms is traversed to lookup the corresponding
invf entries. Before the invf entry is processed, all query and term
relevant statistics and primitive quantities are calculated.
For example, fqt, ft, wt, rqt, wqt, Wq-partial-sum.
To save unnecessary calculations, there is a test for each value
to see whether it is needed e.g. "if (sim->variant_needs & NEEDS_wt) ... ".
Aside: Variant Needs
--------------------
The idea behind the "variant_needs" field is to be able to have all
the code in the one place for each possible variant and this code would
get the information at the correct time/place only when it is needed.
The overhead is a "bit-and" and "test" for each component.
The important concern though, is that the "variant_needs" must be
accurate i.e. it should be carefully maintained.
Each possible need is stored as a bit position in a constant/macro
of the form "NEEDS_component" e.g. NEEDS_wt, NEEDS_nt
More recently (Jan/95), I have found that the _need_ing of a component
may be relevant to a particular purpose, that is, it may be needed in
one situation and not another. This was the case for Wd.
For rdt#6, Wd was needed, however, it might not be needed for Bqd, the
denominator of Sqd. So I changed from NEEDS_Wd to NEEDS_Wd_for_Sqd and
NEEDS_Wd_for_rdt.
--------------------
To handle the different calculations of the variants, I wrote macros
based around a switch statement. All these macros are stored in the file
"similarity.h" . The point of doing this is to centralise most of the changes
and cut down on the number of files which have to be altered if a new
way of calculating a primitive is to be added.
This is achieved by having a data record called, "Similarity_variant" whose
fields includes all the statistics and the similarity primitives.
So the standard procedure is to see if something is needed and if so, then
extract it from an mg file or calculate it using a macro - most of the
input and output to the macro is done via the "Similarity_variant" structure.
As well as calculating Aqd, it is also a convenient place to calculate Wq.
Previously, for Cosine measure, Wq was not needed because each Aqd was directly
divided by Wq and thus would not change the ordering. However, there are
some Sqd's which divide by a sum involving Wq and thus need Wq.
Calculating Sqd and Ordering Documents
======================================
The file query.ranked.c contains the code for calculating Sqd using the
approximate Wd in order to do the ranking of the documents.
The mg-1 file was cleaned up and modified slightly.
All the heap data structures and routines were taken out and placed in
their own file, heap_weights.c/.h .
The major components/steps in the ranking process were abstracted into
macros and functions:
calc_MaxParas
insert_heap_entry
insert_greater_heap_entry
approx_guided_insert
fill_initial_heap
add_heap_remainders
change_heap_to_exact
add_remainder_exact_weights
Make_Exact_Root
build_doc_list
Aside: Zeroing of an Aqd Element
-------------------------
One interesting change concerned the use of zeroing out an Aqd element
so as to mark it as being used in the heap.
In the heap, the approx. Sqd is stored based on Aqd/Wd-approx.
Later when Aqd is required, it can be extracted from the approx. Sqd
by muliplication of Wd-approx i.e. Aqd = Sqd-approx * Wd-approx.
This, however, is not always possible for the various Sqd variants.
So, instead of zeroing out Aqd, I decided just to make it negative.
-------------------------
In some Sqd cases, Wd and Wd-approx. are not required. In which case,
step 3 which calls on "change_heap_to_exact", "Heap_Build" and
"add_remainder_exact_weights" is not required.
FILES
* build_Aqd.c
* query.ranked.c
* similarity.c/.h (in libmg)
* heap_weights.c/.h
* backend.c/.h
*************************************************************
TITLE
Similarity variants for mg_weights_build
APPLICATION
mg-2
TYPE
extend
REPORT
[email protected]/[email protected] - June 1994
FIX
Tim Shimmin (no longer emailable) - July 1994 .. Feb 1995
CLAIM
"mg_weights_build" needs to be altered to allow modification of
the similarity measure.
PROBLEM
See CITRI/TR-95-3 for more details.
SOLUTION
The weight files which are generated for a particular similarity
measure have their names extended by a suffix.
In the case of Wd#1 no weights are generated.
In the case of the standard cosine weights, Wd#2, a 3 letter suffix
is used to represent .
In the case of the other Wd variants, a one letter suffix is used
to represent which Wd variant it is.
In each case, the variant input (e.g. -q 22222222) should be the whole
similarity variant string and the relevant fields will be extracted out.
This is done for consistency in code and interface.
The code is fairly similar to the original.
A dependency check has been added so that the dates of files and the
type of needed files is verified before building.
The dependencies include, invf dictionary, invf index, invf, fmd and wt files.
The major change here, is the possible use of fmd and/or wt files.
Later when I was having to write mg_fmd_build and mg_wt_build,
I decided to abstract out some macros, namely:
Get_ft, Get_ft_Ft, loop_invf_entry,
which were put into "build_lib.c" .
"loop_invf_entry" takes a function/macro name as a parameter and applies
it to the sim record (with field fdt set), current doc number and modifies
the return value.
FILES
* mg_weights_build.c
* build_lib.c
*************************************************************
TITLE
Mgstat with non-existent files
APPLICATION
mg-1
TYPE
bug
REPORT
[email protected] - 16 May 1994
FIX
Tim Shimmin (no longer emailable) - 10 Aug 1994
CLAIM
NaNs and Infinites would be printed out by mgstat
if unable to open .text or .text.dict file.
PROBLEM
The NaNs etc. were output in the column stating
the percentage size of the file compared with the
number of input bytes of the source text data.
If it couldn't read the .text file with its
header describing the number of source text bytes, then
in working out the percentage it would divide by zero.
Also due to some bad control flow, it wouldn't attempt to
open the .text file if it failed when opening
the .text.dict file.
SOLUTION
Only printout the percentage if we can read the header
from the .text file.
Read in text header irrespective of text dictionary file.
FILES
* mgstat.c
*************************************************************
TITLE
Boolean tree optimisations
APPLICATION
mg-2
TYPE
extend
REPORT
(i) Tim Shimmin (no longer emailable) - 28/Sep/94
(ii) Tim Shimmin (no longer emailable) - 12/Dec/94
FIX
(i) Tim Shimmin (no longer emailable) - 18/Oct/94
(ii) Tim Shimmin (no longer emailable) - 21/Dec/94
CLAIM
The initial prompt for investigating the optimisation of
boolean queries is noted in the mg-1 mod14.txt.
The code for optimising seemed to have a number of faults.
PROBLEM
Boolean optimisation was unreliable.
SOLUTION
Initially (in case (i) above, see mg-1/mod14.txt), I rewrote
all the boolean tree and optimising code. I converted the boolean
expression into DNF. I did this after reading some notes about
the steps involved in optimisation and they suggested standardising
in some normal form. I thought that DNF would be appropriate so that
all the terms are converted to be part of "and" expressions and be
evaluated quickly using skipping.
This, however, can suffer quite badly if the distributive law is
applied to often and the query expands in size. If there was
some sort of cacheing of invf entries, then it might not be so
bad otherwise there is quite an overhead on reading the same
invf entry more than once.
As it happens, CNF queries are reasonably common, where the user
queries with a conjunction of disjunctions of similes:
e.g. (car | automobile | vehicle) & (fast | quick | speedy)
This sort of CNF query expands a hell of a lot !
So after speaking with Justin who wanted to benchmark Atlas with Mg on
these sort of queries, I looked up the MG book for other ideas.
The method that I implemented was the following:
-----------------------------------------------------------------
Steps of tree modifications:
Gets literals by pushing the nots down, detecting T/F at leaves
and collapsing the tree by detecting 'and' of 'and's and 'or' of
'or's.
Next it looks at the or nodes and if all the children are terms
then it marks the or-node as such.
Finally, the or-term-nodes are sorted by using the sum of their
ft's for comparison.
Steps at query evaluation:
If it comes across an 'and' of 'or-terms' then the evaluation is
done noting the distributive law.
I.e. a & (b | c | d) = (a & b) | (a & c) | (a & d)
Assuming 'a' is the c-set of documents.
All of 'a' is tested against 'b' and matching ones are marked.
Next, all the unmarked members of 'a' are tested against 'c'.
Likewise for 'd'.
Now all the marked members of our c-set are kept.
When we do the testing, we can use the skipping in the invf entries.
-----------------------------------------------------------------
After doing this, I added the choice of which type of optimisation
the user wanted by adding query-environment-variable, "optimise_type".
Type 0 = no parse tree modification.
Type 1 = Or-term recognition and CNF query evaluation optimisation.
Type 2 = Put into DNF form. [generally not recommended]
FILES
* bool_tree.c
* bool_parser.y
* bool_optimiser.c
* bool_query.c
* bool_tester.c
* term_lists.c
* query_env.c
* invf_get.c [GetDocsOp]
*************************************************************
TITLE
nonexistent HOME bug
APPLICATION
mg-1, mg-2
TYPE
bug
REPORT
[email protected] - 2/May/95
FIX
Tim Shimmin (no longer emailable) - 2/May/95
CLAIM
"The big problem was that mgquery crashes when the HOME environment
variable is not set, which is the case when it is run by the www server."
[...] "I expect it happens when looking for $HOME/.mgrc."
PROBLEM
The result of getenv("HOME")" was used directly in
a sprintf call. If the environment variable HOME
was not in existence then null would be used.
In some C libraries sprintf will convert the 0
string into the string "(null)" on others it will core dump.
(For example, Solaris seems to core dump, sunos 4 seems ok).
SOLUTION
The result from getenv("HOME")" is tested before
being used.
FILES
* commands.c
*************************************************************
TITLE
mgquery collection name preference
APPLICATION
mg-1, mg-2
TYPE
improve
REPORT
[email protected] - 2/May/95
FIX
Tim Shimmin (no longer emailable) - 4/May/95
CLAIM
Surely something must override mquery's preference for ./bib.
If MGDATA is set correctly, I think it should prefer that collection,
and -d should definitely override it.
I could always say -d . if I really wanted ./bib.
PROBLEM
Currently the priority is:
1. Check if ./name is a directory,
If so then use it as the collection directory.
2. Check if ./name.text is a file,
If so then use ./ as the collection directory.
3. Check if mgdir/name is a directory,
If so then use mgdir/name as the collection directory.
4. Otherwise,
Use mgdir/name as the database file prefix.
This would be the case if one used "-f alice/alice".
However, one would then not specify a final name argument
and we'd never get here. Go figure ???
SOLUTION
Moved step 3 to the top instead.
FILES
* mgquery.c [search_for_collection()]
*************************************************************
TITLE
Printout of query terms
APPLICATION
mg-1, mg-2
TYPE
extend
REPORT
Tim Shimmin (no longer emailable) - April 95
FIX
Tim Shimmin (no longer emailable) - April 95
CLAIM
No easy way to find out the parsed and stemmed words
used in the query. Would like to know these words
so I can call a separate highlighting program to
highlight these words.
PROBLEM
No facility available.
SOLUTION
A ".queryterms" mgquery command was added which lists
out the parsed/stemmed queryterms of the last query.
FILES
* commands.c (added CmdQueryTerms)
*************************************************************
TITLE
mg_getrc
APPLICATION
mg-1, mg-2
TYPE
extend
REPORT
[email protected] - 2/May/95
FIX
-
CLAIM
Repeated code had to be written for different named
gets but really the same type of parsing required.
E.g. one might want to use a standard method for inserting
^Bs between paragraphs for different books. One doesn't
want to write duplicate code for each different named book,
rather note that each book should be filtered "book" style.
PROBLEM
There was no way of abstracting out types of filters from
the name of an instance of a collection.
SOLUTION
Allow information to be given with <name, type, files>.
This extra info can be provided in a mg_getrc file.
See man page for mg_get for details.
FILES
* mg_get.sh
*************************************************************
TITLE
TREC DocNo file
APPLICATION
mg-2
TYPE
improve
REPORT
Tim Shimmin (no longer emailable) - 26/May/95
FIX
Tim Shimmin (no longer emailable) - 26/May/95
CLAIM
MG has problems dealing with trec docnos for trec disk 3.
PROBLEM
Trec DocNos file didn't have a wide enough field
to handle disk 3.
SOLUTION
Allow different width fields for file.
It is still fixed width but a number in the header
says how wide the field is.
FILES
* passes/mg.special.c
* query/mgquery.c
*************************************************************
TITLE
Boolean optimiser #1 with `!'
APPLICATION
mg-1, mg-2
TYPE
bug
REPORT
[email protected] - 20/7/95
FIX
Tim Shimmin (no longer emailable) - 27/7/95
CLAIM
Complained about not-nodes.
e.g. complained about "croquet & !hedgehog"
PROBLEM
Boolean optimiser type#1 didn't convert
"and not"s into diff nodes.
SOLUTION
Added code to convert '&!' to '-'.
FILES
* mg/bool_optimiser.c [mg-1]
* query/bool_optimiser.c [mg-2]
*************************************************************
TITLE
Autoconfiguring mg-1
APPLICATION
mg-1
TYPE
improve
REPORT
many people - 94/95
FIX
Tim Shimmin (no longer emailable) - Aug/95
CLAIM
Portability is limited by setting up c-macros just for particular
machines and operating systems.
People had to make changes for HP, Next, Linux, Dec Alpha, ...
PROBLEM
Porting was only targetting at the machines that the author had
access to.
SOLUTION
Use GNU's autoconfigure program.
This allows checking of the systems features/characteristics.
It also allows some checking for specific machines/OS - although
I have not utilised this option.
I used GNU's tar-1.11.8 as an example to base my changes on.
I also used autoscan to generate the initial "configure.in".
The "Makefile.in"s were done very similarly to GNU tar's.
The "config.h" and "sysfuncs.h" files were scrapped and
rewritten. The new "config.h" file is generated by the configure
script - it contains all the #define's for the system features.
The "sysfuncs.h" file wraps up a number of system headers.
For example, some systems use , while some use ;
which one is included is decided in "sysfuncs.h".
I have also used GNU tar's use of ansi2knr in its Makefiles.
This should hopefully allow the package to work on a system with
only a K&R C compiler.
However, there are probably problems with what I have done.
I am concerned about <stdarg.h> for example.
I also noticed that "ansi2knr" require function definitions as
the GNU coding style recommends ie. with function name the first
string on the line. This prompted me to run all the package's code
thru GNU's indent.
Setting up the configure changes is difficult. It really seems
necessary to try the package out on many target machines so one
can know what is necessary.
A simple check target for the main Makefile has been written.
It is used to see if the installation is working - it does
not test much of the functionality of mg.
It does cmp's on data files and diff's on query/result files.
FILES
Most of the files in the distribution.
*************************************************************
TITLE
Consistent use of stderr
APPLICATION
mg-1
TYPE
improve
REPORT
[email protected] - 16 May 1994
FIX
Tim Shimmin (no longer emailable) - 11 August 1994
CLAIM
Inconsistent use of stdout/stderr in usage messages.
PROBLEM
Sometimes used "printf" and sometimes used "fprintf(stderr"
in usage messages.
SOLUTION
All should now use "fprintf(stderr" in usage messages.
FILES
* mg_compression_dict.c
* mg_compression_dict.1
* mg_fast_comp_dict.c
* mg_fast_comp_dict.1
* mg_invf_dict.c
* mg_invf_dict.1
* mg_invf_dump.c
* mg_invf_dump.1
* mg_invf_rebuild.c
* mg_invf_rebuild.1
* mg_perf_hash_build.c
* mg_perf_hash_build.1
* mg_text_estimate.c
* mg_text_estimate.1
* mg_weights_build.c
* mg_weights_build.1
*************************************************************
TITLE
xmg bug
APPLICATION
mg-1
TYPE
bug
REPORT
[email protected] - 22 April 1994
FIX
[email protected] - 22 April 1994
CLAIM
"Serious problem in xmg, which I fear occurs whenever a query
doesn't return anything."
PROBLEM
??
SOLUTION
[xmg.sh 201] set rank 0
FILES
* xmg.sh
*************************************************************
TITLE
Unnecessary loading of text
APPLICATION
mg-1
TYPE
bug
REPORT
Tim Shimmin (no longer emailable) - ?? August 1994
FIX
Tim Shimmin (no longer emailable) - 12 August 1994
CLAIM
Mg was loading and uncompressing text when the
query did not require the text.
PROBLEM
There was no test for the query mode
before loading and uncompressing the text.
SOLUTION
Only load/uncompress text if query mode
is for text, headers or silent(for timing).
FILES
* mgquery.c
*************************************************************
TITLE
Man page errors
APPLICATION
mg-1
TYPE
bug
REPORT
[email protected] - 16 May 1994
FIX
[email protected] - 16 May 1994
CLAIM
Man page errors.
PROBLEM
See below.
SOLUTION
"The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
man pages.
A large number of errors of spelling, typography, spacing, fonts,
grammar, omitted words, slang, punctuation, missing man page
cross-references, and man-page style have been corrected."
FILES
* mg_compression_dict.1
* mg_fast_comp_dict.1
* mg_get.1
* mg_invf_dict.1
* mg_invf_dump.1
* mg_invf_rebuild.1
* mg_passes.1
* mg_perf_hash_build.1
* mg_text_estimate.1
* mg_weights_build.1
* mgbilevel.1
* mgbuild.1
* mgdictlist.1
* mgfelics.1
* mgquery.1
* mgstat.1
* mgtic.1
* mgticbuild.1
* mgticdump.1
* mgticprune.1
* mgticstat.1
* xmg.1
*************************************************************
TITLE
Man page overview
APPLICATION
mg-1
TYPE
extend
REPORT
FIX
Tim Shimmin (no longer emailable) - 17 August 1994
CLAIM
"Write new mg.1 file to give a brief overview of mg, with samples
of how to use it. Otherwise, users are likely to be completely
overwhelmed by the number of programs (about 20) which might need to
be used, when in reality, only 2 or 3 are likely to be run by end
users."
SOLUTION
It was thought that mg.1, written by Nelson Beebe, was very useful
but a bit too comprehensive for an introduction.
Therefore, two man files, mgintro.1 and mgintro++.1 were written
with the basic stuff in mgintro.1 and slightly more advanced stuff
in mgintro++.1 .
FILES
* mg.1
* mgintro.1
* mgintro++.1
*************************************************************
TITLE
Parse errors not bus errors
APPLICATION
mg-1
TYPE
bug
REPORT
[email protected] - 2 Jun 94
FIX
Tim Shimmin (no longer emailable) - 19 Aug 94
CLAIM
"These two queries
(which I typed in before I knew what I was doing!!)
> The Queen of Hearts, she made some tarts
> "The Queen of Hearts" and "she made some tarts"
produced the following result:
mgquery : parse error
Bus error
"
PROBLEM
What is expected to happen under boolean querying:
Query1:
> The Queen of Hearts, she made some tarts
will produce a parse error due to the comma which
is not a valid TERM.
Query2:
> "The Queen of Hearts" and "she made some tarts"
will store a post-processing string
of ''The Queen of Hearts" and "she made some tarts'' and
will have a main boolean query of the empty string.
This is because the postprocessing string takes in
everything between the first quote and the last one.
An empty string is illegal for the boolean grammar and
hence a parse error.
The problem stems from the fact that the processing of
the parse tree is carried out, even though we have a
parse error. In the case of using an empty string to build
a parse tree, it is likely to leave the parse tree undefined.
SOLUTION
As soon as we find out that there is a parse-error,
we abandon any processing of the parse tree.
FILES
* query.bool.y
* query.bool.c (generated from query.bool.y)
*************************************************************
TITLE
Perfect hashing on small vocab
APPLICATION
mg-1
TYPE
bug
REPORT
Tim Shimmin (no longer emailable) - July 1994
FIX
[email protected] - July 1994
CLAIM