-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the indexed users db format #934
base: master
Are you sure you want to change the base?
Conversation
The indexed format is a tree-structured database. Each unique string is stored only once in the db and referenced through pointers by each dmr ID entry that uses that string. The new format uses about half the space of the standard md380 userdb format.
Hi Travis. Please review and comment. I've been running various iterations of this code and new db format for a couple of months now without any issues. The changes I made to usersdb.c support both the standard userdb format and this new indexed tree-structured format. The new format also begins with a single ascii line containint "0", so if the new format is installed on a radio running old firmware, it just looks like a 0-length database. After support for the new db format is added to md380tools, the firmware will support either format. A description of the format is contained in README-INDEXEDDB.md . The repo at https://github.com/DaleFarnsworth/md380IndexedUserDB contains C programs that convert both ways between the standard db format and this new indexed format. The conversion back and forth is lossless. Thanks. |
Any volunteers to review this code? At first glance it's a worthy contribution, but I'm too burned out on this project to review the code thoroughly myself. |
There is another / further compression which can be applied, as long as you only need upper and lower case ASCII and numbers and space and comma, because the total number of unique chars is 64 not 256. Hence 4 ASCII bytes can be packed into 3 bytes. AFIK. This is the compression method used by some manufacurers like Connect Systems. |
We don't need comma, but currently, the users db I use has '#', '&', "'", (single quote), '(', ')', '*', '+', '-', '.', '/', ':', ';', '=', '?', '@', ']', '_', '`', '|', and '$'. We could avoid some of these with cleanup, but I think we'll still benefit from having at least space, dash, ampersand, and period. I find it tough to get to the required 64 character alphabet. One of my (admittedly self-imposed) requirements was that the current database contents be fully supported. I don't plan to add any character string compression, but others are welcome to do so. |
No worries It was just a suggestion, as it does yield about 30% extra compression on the entire uncompressed string for each record. However, it would yield less compression on your shorter sub strings. BTW. I initially thought your compression, also handed all the complete duplicate records, where people have 2 or 3 ID's and completely the same information in each, apart from the ID I wonder if you could somehow add that as some sort of special case. There are also a large number of ID's which hardly ever get used. HamDigital.org used to maintain a list of active ID's which could be downloaded with activity range limits up to 1 year or more. And I recall, only about 50% of the IDs were every active in any given year. Of course of DMR MARC supported TA, then none of this would be necessary ;-) And I don't know why no one has written an extension to MMDVMHost to inject TA, because that would fix the problem for the large number of people using hotspots etc on DMR MARC, and potentially for all DMR MARC repeaters which use MMDVMHost Unfortunetly I don't have time to update MMDVMHost, because I'm busy on loads of other projects |
My method already only stores one record when multiple dmrids have the same callsign, name, etc. It's not a special case. |
ok. thats good to know |
It looks like (back-of-the-envelope guesstimate) that we could save an extra 10% (on a full database containing names, cities, states and countries) by encoding the most often occurring character pairs as unused character values. I.e. encode the current characters into values 0 to <number_of_unique-characters>-1 and use values <number_of_unique_characters> to 255 to represent the most often occurring character pairs. And the decoding would be quite simple. I think I'll code it up and see what it gives us. If that's implemented it will be independent of the current code, so I would still appreciate someone's careful review of this PU as it currently stands. |
I'd not be using it with MD380 tools, and unfortunatly I'm also mega busy with other projects, so this one would not get looked at for several months |
I prototype the character pair compression. My estimate of 10% savings was way off. It's actually only 4%. And that's on the indexed file. The saving based on the original file is less than 2%. I don't know that it's worth it. |
The indexed format is a tree-structured database. Each
unique string is stored only once in the db and referenced through
pointers by each dmr ID entry that uses that string.
The new format uses about half the space of the standard
md380 userdb format.