-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extension of gtoplevelsof #57
Comments
I've had these exact items on my internal TODO list for some time so maybe I ought to get around to them. The However, there are two workarounds for this, as I note on that issue thread. I am very reluctant to force matrices to be large despite the limit, but I'd be very willing to re-write the interaction between mata and C so that it uses temporary files on disk, as opposed to matrices in Stata. This has the added bonus that you would be able to specify Now that I think about it, I have already done this for
You can even save these in mata via |
Would it be possible to save them as a tempfile? Those are easy to read from Mata, and it shouldn't affect speed much (the cost of displaying the list on screen is probably orders of magnitude higher than the cost of loading a small csv file). You can then use something like moremata's Another faster but more difficult alternative would be for gtools to be able to save files in Mata matrix format, so they are easier to load. |
@sergiocorreia Yes, it's possible to use tempfiles; you can see here (saved from C here and read back here) that I do some of this already, but I am unfamiliar with "mata format"; do you have a reference? I think there are six items here, but I don't think it will be too hard since
|
That sounds just like what I'm looking for! |
Mata allows you to save matrices to disk, so I thought the specification might be out there in the same way as Stata's dta. However, it seems that the code is closed (see For instance, regarding c), the first part of the |
What's the speed gain relative to reading a binary file? if small then maybe it's not worth it (in expectation most calls will not be for several million levels where disk write and reads start to make a really noticeable difference). If it's more than that then maybe it's worth it. |
True. Anything that will be read on screen would be better just saved as CSV. Also, you can do the formatting in C, so instead of having some values as strings and some as numbers, you can have everything as string. (In fact, why save the output as CSV instead of just pure strings that can be printf()ed |
Ok, I've mostly figured this out. I've been debating what to do about the variable levels. Since This will save the results in a mata object, with the string and numeric levels in separate matrices. I think I'll add a printf method to print the levels in logical order, but by default I want to preserve the raw data. |
This just hit develop. Both |
Features - `greshape` supports `@` syntax for wide and long. Change the string to be matched via `match()` - `greshape` supports stata varlist syntax for long to wide (may not be combined with `@` within a stub). - `greshape` does not support varlist syntax for wide to long, but can use `match(regex)` for complex wide to long matches (see examples). - Closes #57 - `glevelsof, mata[(name)]` saves the levels to mata. The levels are _not_ stored in `r(levels)` and option `local()` is not allowed. With `silent`, the levels are additionally not formatted. - `glevelsof, mata numfmt()` requires `numfmt` to be a mata print format instead of a C print format. - `gtop, ntop(.)` and `gtop, ntop(-.)` now allow printing all the levels from largest to smallest or the converse. - `gtop, alpha` sorts the top levels in variable order. if `gtop -var, alpha` is passed then they are sorted in reverse order. - `gtop, mata` uses temporary files on disk to read the levels from C via mata. Matrices and locals are not used, meaning `r(levels)`, `r(toplevels)`, and the resuls stored via the option -matrix()-, ``r(`matrix')``, are no longer available. The user can access each of these via the mata object `GtoolsByLevels` (the user can change the name of this object via `mata(name)`). The levels are stored raw in `GtoolsByLevels.charx` and `GtoolsByLevels.numx`; the levels are stored formatted in `GtoolsByLevels.printed`; the frequencies are stored in `GtoolsByLevels.toplevels`. - `r(matalevels)` stores the name of the mata object with the levels and frequencies. - `gtop` also stores `r(ntop)`, `r(nrows)`, and `r(alpha)` as return scalars, for the numbere of top levels (if `.`, this will be `r(J)`), the number of rows in the `toplevels` matrix (it may or not include a row for "other" and a row for "missing"), and whether the top levels are sorted by their values. - `gtop, mata numfmt()` requires `numfmt` to be a mata print format instead of a C print format.
Hey -- I've been on vacation and buried in work after vacation, but I took a look at this and it's absolutely great! Thanks so much for these improvements! |
I work with a bunch of former and current SAS programmers and have gotten use to the using the SAS PROC FREQ for checking our data construction. It's really handy. There is a tablist command for Stata, but it's really slow because it amounts to running
preserve
,contract
, and thenrestore
.I would like to extend the
gtools
package to give similar output astablist
. This should be very easy since the heavy lifting could be borrowed from your gtoplevelsof.ado command. Consider the following example:At first glance, it appears two main changes would be required.
varlist
, rather than truncating at the first 10. I tried just settingntop()
to some really big number (e.g.,ntop(1e12)
) but when I do that I get an error aboutmatsize
not being large enough. I'm not sure why that's a problem since I thought your back end was in Mata.You could get fancy with other other formatting options, but the two edits above are key.
I'd be happy to help. I just need some help sorting through the gtools stuff. My first attempt recieved errors when it hit
_gtools_internal
.The text was updated successfully, but these errors were encountered: