Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

termnames #299

Merged
merged 25 commits into from
Sep 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name = "StatsModels"
uuid = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
version = "0.7.2"
version = "0.7.3"

[deps]
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
Expand All @@ -10,6 +10,7 @@ Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
ShiftedArrays = "1277b4bf-5013-50f5-be3d-901d8477a67a"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
StatsAPI = "82ae8749-77ed-4fe6-ae5f-f523153014b0"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
StatsFuns = "4c63d2b9-4356-54db-8cca-17b64c39e42c"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
Expand All @@ -21,6 +22,7 @@ DataAPI = "1.1"
DataFrames = "1"
DataStructures = "0.17, 0.18"
ShiftedArrays = "1, 2"
StatsAPI = "1"
StatsBase = "0.33.5, 0.34"
StatsFuns = "0.9, 1.0"
Tables = "0.2, 1"
Expand Down
1 change: 1 addition & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
StatsAPI = "82ae8749-77ed-4fe6-ae5f-f523153014b0"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
Expand Down
1 change: 1 addition & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ end
term
coefnames
modelcols
termnames
```

### Higher-order terms
Expand Down
10 changes: 5 additions & 5 deletions docs/src/internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ FormulaTerm{Term, Term}
```

!!! note

As always, you can introspect which method is called with

```julia
Expand Down Expand Up @@ -395,7 +395,7 @@ possible to use an existing function, the best practice is to define a new
function to make dispatch less ambiguous.

```jldoctest 1
using StatsBase
using StatsAPI
# syntax: best practice to define a _new_ function
poly(x, n) = x^n

Expand Down Expand Up @@ -444,7 +444,7 @@ StatsModels.termvars(p::PolyTerm) = StatsModels.termvars(p.term)
# number of columns in the matrix this term produces
StatsModels.width(p::PolyTerm) = p.deg

StatsBase.coefnames(p::PolyTerm) = coefnames(p.term) .* "^" .* string.(1:p.deg)
StatsAPI.coefnames(p::PolyTerm) = coefnames(p.term) .* "^" .* string.(1:p.deg)

# output

Expand Down Expand Up @@ -558,9 +558,9 @@ PolyTerm{Term, ConstantTerm{Int64}}
```

!!! note

The functions like `poly` should be exported by the package that provides
the special syntax for two reasons. First, it makes run-time term
the special syntax for two reasons. First, it makes run-time term
construction more convenient. Second, because of how the `@formula` macro
generates code, the function that represents special syntax must be
available in the namespace where `@formula` is _called_. This is because
Expand Down
6 changes: 5 additions & 1 deletion src/StatsModels.jl
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
module StatsModels

using Tables
using StatsAPI
using StatsBase
using ShiftedArrays
using ShiftedArrays: lag, lead
using DataStructures
using DataAPI
using DataAPI: levels
using Printf: @sprintf
using StatsAPI: coefnames, fit, predict, dof
using StatsFuns: chisqccdf

using SparseArrays
Expand All @@ -32,10 +34,11 @@ export
HelmertCoding,
SeqDiffCoding,
HypothesisCoding,

coefnames,
setcontrasts!,
formula,
termnames,

AbstractTerm,
ConstantTerm,
Expand Down Expand Up @@ -81,5 +84,6 @@ include("formula.jl")
include("modelframe.jl")
include("statsmodel.jl")
include("lrtest.jl")
include("deprecated.jl")

end # module StatsModels
50 changes: 26 additions & 24 deletions src/contrasts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ C(levels = ::Vector{Any}, base = ::Any) # specify levels and base
mean of the lower levels
* [`SeqDiffCoding`](@ref) - Code for differences between sequential levels of
the variable.
* [`HypothesisCoding`](@ref) - Manually specify contrasts via a hypothesis
* [`HypothesisCoding`](@ref) - Manually specify contrasts via a hypothesis
matrix, which gives the weighting for the average response for each level
* [`StatsModels.ContrastsCoding`](@ref) - Manually specify contrasts matrix,
which is directly copied into the model matrix.
Expand All @@ -79,15 +79,15 @@ The easiest way to specify custom contrasts is with `HypothesisCoding` or
contrast coding system, you can subtype `AbstractContrasts`. This requires a
constructor, a `contrasts_matrix` method for constructing the actual contrasts
matrix that maps from levels to `ModelMatrix` column values, and (optionally) a
`termnames` method:
`coefnames` method:

```julia
mutable struct MyCoding <: AbstractContrasts
...
end

contrasts_matrix(C::MyCoding, baseind, n) = ...
termnames(C::MyCoding, levels, baseind) = ...
coefnames(C::MyCoding, levels, baseind) = ...
```

# References
Expand All @@ -103,30 +103,32 @@ abstract type AbstractContrasts end
# Contrasts + Levels (usually from data) = ContrastsMatrix
struct ContrastsMatrix{C <: AbstractContrasts, M <: AbstractMatrix, T, U}
matrix::M
termnames::Vector{U}
coefnames::Vector{U}
levels::Vector{T}
contrasts::C
invindex::Dict{T,Int}
function ContrastsMatrix(matrix::M,
termnames::Vector{U},
coefnames::Vector{U},
levels::Vector{T},
contrasts::C) where {U, T, C <: AbstractContrasts, M <: AbstractMatrix}
allunique(levels) || throw(ArgumentError("levels must be all unique, got $(levels)"))
invindex = Dict{T,Int}(x=>i for (i,x) in enumerate(levels))
new{C,M,T,U}(matrix, termnames, levels, contrasts, invindex)
new{C,M,T,U}(matrix, coefnames, levels, contrasts, invindex)
end
end

# only check equality of matrix, termnames, and levels, and that the type is the
StatsAPI.coefnames(cm::ContrastsMatrix) = cm.coefnames

# only check equality of matrix, coefnames, and levels, and that the type is the
# same for the contrasts (values are irrelevant). This ensures that the two
# will behave identically in creating modelmatrix columns
Base.:(==)(a::ContrastsMatrix{C}, b::ContrastsMatrix{C}) where {C<:AbstractContrasts} =
a.matrix == b.matrix &&
a.termnames == b.termnames &&
a.coefnames == b.coefnames &&
a.levels == b.levels

Base.hash(a::ContrastsMatrix{C}, h::UInt) where {C} =
hash(C, hash(a.matrix, hash(a.termnames, hash(a.levels, h))))
hash(C, hash(a.matrix, hash(a.coefnames, hash(a.levels, h))))

"""
An instantiation of a contrast coding system for particular levels
Expand Down Expand Up @@ -166,7 +168,7 @@ function ContrastsMatrix(contrasts::C, levels::AbstractVector{T}) where {C<:Abst
# 3. contrast levels missing from data: would have empty columns, generate a
# rank-deficient model matrix.
c_levels = something(DataAPI.levels(contrasts), levels)

mismatched_levels = symdiff(c_levels, levels)
if !isempty(mismatched_levels)
throw(ArgumentError("contrasts levels not found in data or vice-versa: " *
Expand Down Expand Up @@ -198,7 +200,7 @@ function ContrastsMatrix(contrasts::C, levels::AbstractVector{T}) where {C<:Abst
"$c_levels."))
end

tnames = termnames(contrasts, c_levels, baseind)
tnames = coefnames(contrasts, c_levels, baseind)

mat = contrasts_matrix(contrasts, baseind, n)

Expand All @@ -224,7 +226,7 @@ function ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector)
return c
end

function termnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer)
function StatsAPI.coefnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer)
not_base = [1:(baseind-1); (baseind+1):length(levels)]
levels[not_base]
end
Expand All @@ -233,7 +235,7 @@ Base.getindex(contrasts::ContrastsMatrix, rowinds, colinds) =
getindex(contrasts.matrix, getindex.(Ref(contrasts.invindex), rowinds), colinds)

# Making a contrast type T only requires that there be a method for
# contrasts_matrix(T, baseind, n) and optionally termnames(T, levels, baseind)
# contrasts_matrix(T, baseind, n) and optionally coefnames(T, levels, baseind)
# The rest is boilerplate.
for contrastType in [:DummyCoding, :EffectsCoding, :HelmertCoding]
@eval begin
Expand All @@ -254,7 +256,7 @@ DataAPI.levels(c::AbstractContrasts) = nothing
FullDummyCoding()

Full-rank dummy coding generates one indicator (1 or 0) column for each level,
**including** the base level. This is sometimes known as
**including** the base level. This is sometimes known as
[one-hot encoding](https://en.wikipedia.org/wiki/One-hot).

Not exported but included here for the sake of completeness.
Expand Down Expand Up @@ -331,7 +333,7 @@ column is generated with 1 where `variable .== x` and -1 where `variable .== bas
of 0.

If `levels` are omitted or `nothing`, they are determined from the data
by calling the `levels` function when constructing `ContrastsMatrix`.
by calling the `levels` function when constructing `ContrastsMatrix`.
If `base` is omitted or `nothing`, the first level is used as the base.

When all levels are equally frequent, effects coding generates model matrix
Expand Down Expand Up @@ -373,7 +375,7 @@ Helmert coding codes each level as the difference from the average of the lower
levels.

If `levels` are omitted or `nothing`, they are determined from the data
by calling the `levels` function when constructing `Contrastsmatrix`.
by calling the `levels` function when constructing `Contrastsmatrix`.
If `base` is omitted or `nothing`, the first level is used as the base.
For each non-base level, Helmert coding generates a columns with -1 for each of
n levels below, n for that level, and 0 above.
Expand Down Expand Up @@ -462,7 +464,7 @@ function contrasts_matrix(C::SeqDiffCoding, _, n)
end

# TODO: consider customizing term names:
# termnames(C::SeqDiffCoding, levels::AbstractVector, baseind::Integer) =
# StatsAPI.coefnames(C::SeqDiffCoding, levels::AbstractVector, baseind::Integer) =
# ["$(levels[i])-$(levels[i-1])" for i in 2:length(levels)]

"""
Expand Down Expand Up @@ -591,7 +593,7 @@ function contrasts_matrix(C::HypothesisCoding, baseind, n)
C.contrasts
end

termnames(C::HypothesisCoding, levels::AbstractVector, baseind::Int) =
StatsAPI.coefnames(C::HypothesisCoding, levels::AbstractVector, baseind::Int) =
something(C.labels, levels[1:length(levels) .!= baseind])

DataAPI.levels(c::HypothesisCoding) = c.levels
Expand All @@ -602,8 +604,8 @@ DataAPI.levels(c::HypothesisCoding) = c.levels

Coding by manual specification of contrasts matrix. For k levels, the contrasts
must be a k by k-1 Matrix. The contrasts in this matrix will be copied directly
into the model matrix; if you want to specify your contrasts as hypotheses (i.e.,
weights assigned to each level's cell mean), you should use
into the model matrix; if you want to specify your contrasts as hypotheses (i.e.,
weights assigned to each level's cell mean), you should use
[`HypothesisCoding`](@ref) instead.
"""
mutable struct ContrastsCoding{T<:AbstractMatrix} <: AbstractContrasts
Expand Down Expand Up @@ -687,9 +689,9 @@ julia> StatsModels.hypothesis_matrix(cmat)
-1 0 0 1
```

For non-centered contrasts like `DummyCoding`, without including the intercept
the hypothesis matrix is incorrect. So while `intercept=true` is the default for
non-centered contrasts, you can see the (wrong) hypothesis matrix when ignoring
For non-centered contrasts like `DummyCoding`, without including the intercept
the hypothesis matrix is incorrect. So while `intercept=true` is the default for
non-centered contrasts, you can see the (wrong) hypothesis matrix when ignoring
it by forcing `intercept=false`:

```jldoctest hypmat
Expand All @@ -710,7 +712,7 @@ julia> StatsModels.hypothesis_matrix(cmat, tolerance=0) # ugly
1.0 -2.23753e-16 6.91749e-18 -1.31485e-16
-1.0 1.0 -2.42066e-16 9.93754e-17
-1.0 4.94472e-17 1.0 9.93754e-17
-1.0 1.04958e-16 -1.31044e-16 1.0
-1.0 1.04958e-16 -1.31044e-16 1.0
```

Finally, the hypothesis matrix for a constructed `ContrastsMatrix` (as stored by
Expand Down
13 changes: 13 additions & 0 deletions src/deprecated.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
@deprecate(termnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer),
coefnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer),
false)

function Base.getproperty(cm::ContrastsMatrix, x::Symbol)
if x === :termnames
Base.depwarn("The `termnames` field of `ConstrastsMatrix` is deprecated; use `coefnames(cm)` instead.",
:ContrastsMatrix)
return coefnames(cm)
else
return getfield(cm, x)
end
end
Loading
Loading