-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #304 from nadiaenh/glm
Add GLM with design-based standard errors
- Loading branch information
Showing
20 changed files
with
716 additions
and
162 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Generalized Linear Models in Survey | ||
The `glm()` function in the Julia Survey package is used to fit generalized linear models (GLMs) to survey data. It incorporates survey design information, such as sampling weights, stratification, and clustering, to produce valid estimates and standard errors that account for the type of survey design. | ||
|
||
As of June 2023, the [GLM.jl documentation](https://juliastats.org/GLM.jl/stable/) lists the supported distribution families and their link functions as: | ||
```txt | ||
Bernoulli (LogitLink) | ||
Binomial (LogitLink) | ||
Gamma (InverseLink) | ||
InverseGaussian (InverseSquareLink) | ||
NegativeBinomial (NegativeBinomialLink, often used with LogLink) | ||
Normal (IdentityLink) | ||
Poisson (LogLink) | ||
``` | ||
|
||
Refer to the GLM.jl documentation for more information about the GLM package. | ||
|
||
## Fitting a GLM to a Survey Design object | ||
|
||
You can fit a GLM to a Survey Design object the same way you would fit it to a regular data frame. The only difference is that you need to specify the survey design object as the second argument to the `glm()` function. | ||
|
||
```julia | ||
using Survey | ||
apisrs = load_data("apisrs") | ||
|
||
# Simple random sample survey | ||
srs = SurveyDesign(apisrs, weights = :pw) | ||
|
||
# Survey stratified by stype | ||
dstrat = SurveyDesign(apistrat, strata = :stype, weights = :pw) | ||
|
||
# Survey clustered by dnum | ||
dclus1 = SurveyDesign(apiclus1, clusters = :dnum, weights = :pw) | ||
``` | ||
|
||
Once you have the survey design object, you can fit a GLM using the `glm()` function. Specify the formula for the model and the distribution family. | ||
|
||
The `glm()` function supports all distribution families supported by GLM.jl, i.e. Bernoulli, Binomial, Gamma, Geometric, InverseGaussian, NegativeBinomial, Normal, and Poisson. | ||
|
||
For example, to fit a GLM with a Bernoulli distribution and a Logit link function to the `srs` survey design object we created above: | ||
```julia | ||
formula = @formula(api00 ~ api99) | ||
my_glm = glm(formula, srs, family = Normal()) | ||
|
||
# View the coefficients and standard errors | ||
my_glm.Coefficients | ||
my_glm.SE | ||
``` | ||
|
||
## Examples | ||
|
||
The examples below use the `api` datasets, which contain survey data collected about California schools. The datasets are included in the Survey.jl package and can be loaded by calling `load_data("name_of_dataset")`. | ||
|
||
### Bernoulli with Logit Link | ||
|
||
A school being eligible for the awards program (`awards`) is a binary outcome (0 or 1). Let's assume it follows a Bernoulli distribution. Suppose we want to predict `awards` based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below: | ||
|
||
```julia | ||
using Survey | ||
apisrs = load_data("apisrs") | ||
srs = SurveyDesign(apisrs, weights = :pw) | ||
|
||
# Convert yes/no to 1/0 | ||
apisrs.awards = ifelse.(apisrs.awards .== "Yes", 1, 0) | ||
|
||
# Fit the model | ||
model = glm(@formula(awards ~ meals + ell), apisrs, Bernoulli(), LogitLink()) | ||
``` | ||
|
||
### Poisson with Log Link | ||
|
||
Let us assume that the number of students tested (`api_stu`) follows a Poisson distribution, which models the number of successes out of a fixed number of trials. Suppose we want to predict the number of students tested based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below: | ||
|
||
```julia | ||
using Survey | ||
apisrs = load_data("apisrs") | ||
srs = SurveyDesign(apisrs, weights = :pw) | ||
|
||
# Rename api.stu to api_stu | ||
rename!(apisrs, Symbol("api.stu") => :api_stu) | ||
|
||
# Fit the model | ||
model = glm(@formula(api_stu ~ meals + ell), apisrs, Poisson(), LogLink()) | ||
``` | ||
|
||
### Gamma with Inverse Link | ||
|
||
Let us assume that the average parental education level (`avg_ed`) follows a Gamma distribution, which is suitable for modeling continuous, positive-valued variables with a skewed distribution. Suppose we want to predict the average parental education level based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below: | ||
|
||
```julia | ||
using Survey | ||
apisrs = load_data("apisrs") | ||
srs = SurveyDesign(apisrs, weights = :pw) | ||
|
||
# Rename api.stu to api_stu | ||
rename!(apisrs, Symbol("avg.ed") => :avg_ed) | ||
|
||
# Fit the model | ||
model = glm(@formula(avg_ed ~ meals + ell), apisrs, Gamma(), InverseLink()) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,23 @@ | ||
function bydomain(x::Symbol, domain, design::SurveyDesign, func::Function) | ||
gdf = groupby(design.data, domain) | ||
X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic) | ||
return X | ||
function subset(group, design::SurveyDesign) | ||
return SurveyDesign(DataFrame(group);clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights) | ||
end | ||
|
||
function subset(group, design::ReplicateDesign) | ||
return ReplicateDesign{typeof(design.inference_method)}(DataFrame(group), design.replicate_weights;clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights) | ||
end | ||
|
||
function bydomain(x::Symbol, domain, design::ReplicateDesign, func::Function) | ||
function bydomain(x::Union{Symbol, Vector{Symbol}}, domain,design::Union{SurveyDesign, ReplicateDesign}, func::Function, args...; kwargs...) | ||
domain_names = unique(design.data[!, domain]) | ||
gdf = groupby(design.data, domain) | ||
nd = length(gdf) | ||
X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic) | ||
Xt_mat = Array{Float64,2}(undef, (nd, design.replicates)) | ||
for i = 1:design.replicates | ||
Xt_mat[:, i] = | ||
combine( | ||
gdf, | ||
[x, Symbol("replicate_" * string(i))] => | ||
((a, c) -> func(a, weights(c))) => :statistic, | ||
).statistic | ||
domain_names = [join(collect(keys(gdf)[i]), "-") for i in 1:length(gdf)] | ||
vars = DataFrame[] | ||
for group in gdf | ||
push!(vars, func(x, subset(group, design), args...; kwargs...)) | ||
end | ||
ses = Float64[] | ||
for i = 1:nd | ||
filtered_dx = filter(!isnan, Xt_mat[i, :] .- X.statistic[i]) | ||
push!(ses, sqrt(sum(filtered_dx .^ 2) / length(filtered_dx))) | ||
estimates = vcat(vars...) | ||
if isa(domain, Vector{Symbol}) | ||
domain = join(domain, "_") | ||
end | ||
replace!(ses, NaN => 0) | ||
X.SE = ses | ||
return X | ||
end | ||
estimates[!, domain] = domain_names | ||
return estimates | ||
end |
Oops, something went wrong.