Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extension of gtoplevelsof #57

Closed
kkranker opened this issue Mar 19, 2019 · 10 comments
Closed

extension of gtoplevelsof #57

kkranker opened this issue Mar 19, 2019 · 10 comments
Assignees

Comments

@kkranker
Copy link

kkranker commented Mar 19, 2019

I work with a bunch of former and current SAS programmers and have gotten use to the using the SAS PROC FREQ for checking our data construction. It's really handy. There is a tablist command for Stata, but it's really slow because it amounts to running preserve, contract, and then restore.

I would like to extend the gtools package to give similar output as tablist. This should be very easy since the heavy lifting could be borrowed from your gtoplevelsof.ado command. Consider the following example:

. sysuse auto
(1978 Automobile Data)

. gtoplevelsof rep78 foreign make

  rep78    foreign                make |    N  Cum   Pct (%)   Cum Pct (%) 
 --------------------------------------------------------------------------
      1   Domestic       Olds Starfire |    1    1       1.4           1.4 
      1   Domestic      Pont. Firebird |    1    2       1.4           2.7 
      2   Domestic       Cad. Eldorado |    1    3       1.4           4.1 
      2   Domestic   Chev. Monte Carlo |    1    4       1.4           5.4 
      2   Domestic         Chev. Monza |    1    5       1.4           6.8 
      2   Domestic      Dodge Diplomat |    1    6       1.4           8.1 
      2   Domestic        Dodge Magnum |    1    7       1.4           9.5 
      2   Domestic     Dodge St. Regis |    1    8       1.4          10.8 
      2   Domestic        Plym. Volare |    1    9       1.4          12.2 
      2   Domestic       Pont. Sunbird |    1   10       1.4          13.5 
 --------------------------------------------------------------------------
                                 Other |   64   74      86.5         100.0 

At first glance, it appears two main changes would be required.

  1. Default to showing all combinations of varlist, rather than truncating at the first 10. I tried just setting ntop() to some really big number (e.g., ntop(1e12)) but when I do that I get an error about matsize not being large enough. I'm not sure why that's a problem since I thought your back end was in Mata.
  2. Provide options for sorting the table by frequency or sorting by `varlist'.

You could get fancy with other other formatting options, but the two edits above are key.

I'd be happy to help. I just need some help sorting through the gtools stuff. My first attempt recieved errors when it hit _gtools_internal.

@mcaceresb mcaceresb self-assigned this Mar 19, 2019
@mcaceresb
Copy link
Owner

I've had these exact items on my internal TODO list for some time so maybe I ought to get around to them. The matsize requirement is explained in this issue and it has to do with how Stata interacts with C (i.e. the plugin backend, rather than the mata backend, because mata cannot interact with C directly).

However, there are two workarounds for this, as I note on that issue thread. I am very reluctant to force matrices to be large despite the limit, but I'd be very willing to re-write the interaction between mata and C so that it uses temporary files on disk, as opposed to matrices in Stata. This has the added bonus that you would be able to specify ntop(.) as short-hand for "all the groups" since saving files on disk would not require advance knowledge of the number of groups.

Now that I think about it, I have already done this for gstats tab so maybe I can just re-purpose that code.

sysuse auto
gen byte ones =  1
gstats tab ones, by(rep78 foreign) s(count percent) pretty

You can even save these in mata via matasave, but I think that's not as clean as just using gtop directly. I'm just saying this for myself as well that there is code that does this already in the code base.

@sergiocorreia
Copy link

Would it be possible to save them as a tempfile? Those are easy to read from Mata, and it shouldn't affect speed much (the cost of displaying the list on screen is probably orders of magnitude higher than the cost of loading a small csv file). You can then use something like moremata's mm_matlist to display the tabulations.

Another faster but more difficult alternative would be for gtools to be able to save files in Mata matrix format, so they are easier to load.

@mcaceresb
Copy link
Owner

mcaceresb commented Mar 19, 2019

@sergiocorreia Yes, it's possible to use tempfiles; you can see here (saved from C here and read back here) that I do some of this already, but I am unfamiliar with "mata format"; do you have a reference?

I think there are six items here, but I don't think it will be too hard since gstats tab does a lot of this already.

  • Dispense with Stata matrices for gtop. Save levels and frequencies as tempfiles.
  • Tempfiles are either in binary format or "mata format".
  • Add option ntop(.) to keep all the levels.
  • Add option alpha to list the top levels in alphabetical order.
  • Add option silent to prevent gtop from printing the output.
  • Add option matasave to save the levels and output as mata objects instead of r()

@kkranker
Copy link
Author

That sounds just like what I'm looking for!

@sergiocorreia
Copy link

I am unfamiliar with "mata format"; do you have a reference?

Mata allows you to save matrices to disk, so I thought the specification might be out there in the same way as Stata's dta.

However, it seems that the code is closed (see help mata matsave and help mata fopen). Thus, we can either a) forget about it, b) ask them if they want to share their spec (AFAIK, it's not that complex), or c) reverse engineer it (if we really want it).

For instance, regarding c), the first part of the mmat files is pretty deterministic, because it's described in view mmat_.mata, adopath asis. For the rest, you can quickly see the if you play with matsave-ing a bunch of matrices. That said, if we ask them I hope/suspect they should be open.

@mcaceresb
Copy link
Owner

What's the speed gain relative to reading a binary file? if small then maybe it's not worth it (in expectation most calls will not be for several million levels where disk write and reads start to make a really noticeable difference). If it's more than that then maybe it's worth it.

@sergiocorreia
Copy link

True. Anything that will be read on screen would be better just saved as CSV. Also, you can do the formatting in C, so instead of having some values as strings and some as numbers, you can have everything as string.

(In fact, why save the output as CSV instead of just pure strings that can be printf()ed

@mcaceresb
Copy link
Owner

Ok, I've mostly figured this out.

I've been debating what to do about the variable levels. Since gtop was originally an extension of glevelsof, the levels are saved and parsed as a Stata macro. I think I will leave the default behavior as-is, and for a large number of levels the user can specify matasave.

This will save the results in a mata object, with the string and numeric levels in separate matrices. I think I'll add a printf method to print the levels in logical order, but by default I want to preserve the raw data.

@mcaceresb
Copy link
Owner

This just hit develop. gtools, upgrade branch(develop) to try it out.

Both gtop and glevelsof accept mata and mata(name) as options to save the levels in mata.

mcaceresb added a commit that referenced this issue Mar 24, 2019
Features

- `greshape` supports `@` syntax for wide and long. Change the string
  to be matched via `match()`

- `greshape` supports stata varlist syntax for long to wide (may not be
  combined with `@` within a stub).

- `greshape` does not support varlist syntax for wide to long, but can
  use `match(regex)` for complex wide to long matches (see examples).

- Closes #57

- `glevelsof, mata[(name)]` saves the levels to mata. The levels are _not_
  stored in `r(levels)` and option `local()` is not allowed. With `silent`,
  the levels are additionally not formatted.

- `glevelsof, mata numfmt()` requires `numfmt` to be a mata print format
  instead of a C print format.

- `gtop, ntop(.)` and `gtop, ntop(-.)` now allow printing all the levels
  from largest to smallest or the converse.

- `gtop, alpha` sorts the top levels in variable order. if `gtop -var, alpha`
  is passed then they are sorted in reverse order.

- `gtop, mata` uses temporary files on disk to read the levels from C
  via mata. Matrices and locals are not used, meaning `r(levels)`,
  `r(toplevels)`, and the resuls stored via the option -matrix()-,
  ``r(`matrix')``, are no longer available. The user can access each
  of these via the mata object `GtoolsByLevels` (the user can change
  the name of this object via `mata(name)`). The levels are stored raw
  in `GtoolsByLevels.charx` and `GtoolsByLevels.numx`; the levels are
  stored formatted in `GtoolsByLevels.printed`; the frequencies are
  stored in `GtoolsByLevels.toplevels`.

- `r(matalevels)` stores the name of the mata object with the levels and frequencies.

- `gtop` also stores `r(ntop)`, `r(nrows)`, and `r(alpha)` as return scalars,
  for the numbere of top levels (if `.`, this will be `r(J)`), the number of
  rows in the `toplevels` matrix (it may or not include a row for "other" and
  a row for "missing"), and whether the top levels are sorted by their values.

- `gtop, mata numfmt()` requires `numfmt` to be a mata print format instead of
  a C print format.
@kkranker
Copy link
Author

kkranker commented May 7, 2019

Hey -- I've been on vacation and buried in work after vacation, but I took a look at this and it's absolutely great! Thanks so much for these improvements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants