-
Notifications
You must be signed in to change notification settings - Fork 1
How to use soc.ca This page is outdated
Here we will try to give an example of how to use the soc.ca package to do a specific correspondence analysis. We are going to assume that you already have the soc.ca package loaded from the source files and that the data is ready for analysis. Text written in italics describe possible errors, bugs or missing features that you might need to be aware of.
The data used in this analysis is data on the top 100 CEO's in Denmark. Much of the data is in Danish.
Rstudio gets our warmest recommendations. In Rstudio you have a very friendly environment for learning R.
Even though we pride ourselves of our plots they almost always require some editing. Inkscape is perfect for this and many other things.
For the code for this example: wiki_example.R Not all the code in the source above is used in the example below.
First we are going to create an object with our data from a .csv file. Any other method will do. The data should be a matrix or data.frame with rows as individuals and variables as columns.
data <- read.csv(file="https://raw.github.com/Rsoc/soc.ca/master/wiki_data.csv", sep=";", encoding="UTF-8")
attach(data)
From this data object we are going to create three different objects: The active variables active
, the supplementary variables sup
and the identifier id
. There are different requirements for each set of variables.
The active and the supplementary variables have to be factors, and have no empty levels. Use str()
to test if the active
and sup
is in factors. They are organized in data.frames.
The identifier is preferably a vector with an unique value for each row in the data. anyDuplicated()
tells you how many id's that are not unique and duplicated()
will tell you which id's that are duplicates. The identifier variable is used to create the cloud of individuals and can be left out.
active <- data.frame(careerprofile_maclean_cat, careerfoundation_maclean_cat,
years_between_edu_dir_cat, time_in_corp_before_ceo_cat,
age_as_ceo_cat, career_changes_cat2, mba, abroad, hd, phd,
education, author, placeofbirth, familyclass_bourdieu,
partnersfamily_in_whoswho, family_in_whoswho)
sup <- data.frame(size_prestige, ownership_cat_2, sector, location)
id <- navn
If your data has no valid identifier, then it can be constructed in the following way:
id <- 1:nrow(data)
The specific correspondence analysis sets certain modalities as passive. You define what modalities that are going to be set as passive by using the set.passive()
function. This function changes the default strings of text that is used to identify the passive modalities. These strings can be either a full or partial match to the names of the modalities.
Technically the set.passive()
function changes the ´soc.ca()´ functions default value, from ´passive="Missing"´ to whatever is specified. You can obtain the names of the modalities in your result by result$names.mod
.
set.passive(c("MISSING", "Missing", "Irrelevant", "residence_value_cat2: Udlandet"))
The actual analysis is made by the soc.ca()
function, which has the following defaults: soc.ca(x, sup=NULL, identifier=NULL, passive="Missing")
. The passive="Missing"
option lets you choose another set of passive modalities than those specified by set.passive()
result <- soc.ca(active, sup, id)
Now we want to know more about our exciting analysis, so lets "print" the results.
result
Specific Correspondence Analysis:
Statistics Scree plot
Active dimensions: 21 | 1. 24.7% ************
Dimensions explaining 80% of inertia: 7 | 2. 18.5% *********
Active modalities: 60 | 3. 14.0% *******
Supplementary modalities: 20 | 4. 8.8% ****
Individuals: 100 | 5. 6.7% ****
Mass in subset 0.96 | 6. 5.6% ***
The active variables:
careerprofile_maclean_cat (2) careerfoundation_maclean_cat (5) years_between_edu_dir_cat (4)
time_in_corp_before_ceo_cat (5) age_as_ceo_cat (4) career_changes_cat2 (3)
mba (3) abroad (3) hd (2)
phd (2) education (9) author (2)
placeofbirth (5) familyclass_bourdieu (7) partnersfamily_in_whoswho (2)
family_in_whoswho (2)
These dimensions contributions are skewed towards (-):
Dim. + - +/-
7. 0.32 0.68 0.48
13. 0.30 0.70 0.42
This output gives us several forms of information on the quality of our analysis. Lets have a look at some of the less obvious ones.
The scree plot gives us how much of the total inertia that is explained by each of the first 6 dimensions.
Active dimensions: This is the number of dimensions which have a contribution above average.
Mass in subset: This is the amount of mass in the active subset of modalities. Remember that the mass of the passive modalities still contributes to the analysis. This measure tells us if we have put too much mass as passive. In this analysis we have 96% in the subset and therefore only 4% "passive mass". So we are in the clear.
The active variables: This gives us the names of the variables that was used in the analysis and how many modalities each variable has as active.
Then we get a balance evaluation of the active dimensions. We see that the 7th. dimension has 68% of its contribution on the negative side and the dimension is therefore skewed towards minus. We now know that this dimension is not good at differentiating on the + side. But we most likely won't look at the 7th dimension anyway so no worries.
Lets have a closer look at the first dimension.
contribution(result, 1)
Here we look at the first dimension or "dim=1".
The modalities contributing above average to dimension: 1.
Ctr. Cor. Coord
education: Cand. Oecon AAU+SDU 85 284 1.94
time_in_corp_before_ceo_cat: 26+ år i firma før topchef 80 280 -1.51
years_between_edu_dir_cat: Under 6 år om at blive direktør 68 250 1.19
time_in_corp_before_ceo_cat: 3-6 år i firma før topchef 62 223 1.17
career_changes_cat2: Ingen karriereskift 62 256 -0.90
education: Elevuddannelse 55 204 -1.03
careerprofile_maclean_cat: Karrierestart i de største virksomheder 41 297 -0.47
careerprofile_maclean_cat: Karrierestart i mindre virksomheder 39 174 0.62
mba: Almindelig MBA 36 121 1.18
years_between_edu_dir_cat: 20+ år om at blive direktør 31 108 -0.99
placeofbirth: Storkøbenhavn 31 133 -0.60
careerfoundation_maclean_cat: Ingeniør, teknik og videnskab 28 99 -0.85
time_in_corp_before_ceo_cat: 16-25 år i i firma før topchef 26 100 -0.65
hd: HD 25 110 -0.53
age_as_ceo_cat: Direktør: under 40 år 23 94 0.53
age_as_ceo_cat: Direktør: 45-49 år 22 91 -0.52
career_changes_cat2: 1-3 karriereskift 21 149 0.34
years_between_edu_dir_cat: 13-19 år om at blive direktør 20 92 -0.44
education: Naturvidenskab 19 65 -0.86
age_as_ceo_cat: Direktør 50+ år 18 74 -0.47
age_as_ceo_cat: Direktør: 40-44 år 17 71 0.46
This output gives the modalities that contribute above average to the first dimension. If we wanted the contribution of all modalities they could be found in result$ctr.mod
or we could export the results, see later.
A different approach is the tab.dim()
which gives us the same as contribution()
but order the results according to +/-. This is useful for publications.
tab.dim(result, 1)
Again the 1 is for the first dimension.
Dimension 1. (+)
Ctr Coord
education: Cand. Oecon AAU+SDU 85 1.94
years_between_edu_dir_cat: Under 6 år om at blive direktør 68 1.19
time_in_corp_before_ceo_cat: 3-6 år i firma før topchef 62 1.17
careerprofile_maclean_cat: Karrierestart i mindre virksomheder 39 0.62
mba: Almindelig MBA 36 1.18
age_as_ceo_cat: Direktør: under 40 år 23 0.53
career_changes_cat2: 1-3 karriereskift 21 0.34
age_as_ceo_cat: Direktør: 40-44 år 17 0.46
Dimension 1. (-)
Ctr Coord
time_in_corp_before_ceo_cat: 26+ år i firma før topchef 80 -1.51
career_changes_cat2: Ingen karriereskift 62 -0.90
education: Elevuddannelse 55 -1.03
careerprofile_maclean_cat: Karrierestart i de største virksomheder 41 -0.47
years_between_edu_dir_cat: 20+ år om at blive direktør 31 -0.99
placeofbirth: Storkøbenhavn 31 -0.60
careerfoundation_maclean_cat: Ingeniør, teknik og videnskab 28 -0.85
time_in_corp_before_ceo_cat: 16-25 år i i firma før topchef 26 -0.65
hd: HD 25 -0.53
age_as_ceo_cat: Direktør: 45-49 år 22 -0.52
years_between_edu_dir_cat: 13-19 år om at blive direktør 20 -0.44
education: Naturvidenskab 19 -0.86
age_as_ceo_cat: Direktør 50+ år 18 -0.47
From the output from tab.dim()
we see that the first dimension is organized according to volume of "organizational capital". The more negative the coordinate the more "organizational capital" the CEO has. To make the maps easier to read we are going to invert the first dimension. A coordinate of -1.5 becomes +1.5. Because we are replicating the analysis in XXX, we are also inverting the second dimension and third dimension.
result <- invert(result, c(1,2,3))
But looking at the tables produced earlier it is clear that the labels are too long, technical and in Danish. So we are going to improve them. First we are exporting the labels into a .csv file, containing two columns; "New Label" and "Old Label".
export.label(result)
This creates the file label_result.csv. If you want it saved in another location or under a different name, use the file=
option.
Open the .csv file in a spreadsheet editor, like Libreoffice and edit the labels under "New Label" to its new version. But remember to leave "Old Label" unchanged. And remember to save in the UTF-8 format. Like a TV-chef we have prepared a little something in advance; a English translation in the "english.csv" file.
result <- assign.label(result, file="https://raw.github.com/Rsoc/soc.ca/master/wiki_labels.csv")
The assign.label()
function does not require that all or any of the labels in the .csv file are present in the result object. This is very useful because the same label file can be used on different analysis.
Sometimes it may be unclear how rare a modality is, and it may therefore be useful to add the number of respondents in each modality to its label. For this we have the add.n()
function.
result <- add.n(result)
Lets have a look at the second dimension with the new labels.
contribution(result, 2)
The modalities contributing above average to dimension: 2.
Ctr. Cor. Coord
Author (n:16) 102 341 1.34
Career changes: 4+ (n:19) 70 242 -1.02
Phd (n:8) 69 210 1.55
Career foundation: State, University, Law and Organisations (n:11) 54 172 1.18
Education: Bsc. Business School (n:11) 51 161 -1.14
Born: Rural areas (n:21) 44 157 -0.77
Born: Aarhus, Odense and Aalborg (n:13) 43 140 0.97
Career start: Enterprise (n:31) 42 173 -0.62
Father: Craftsman or shopkeeper (n:7) 40 121 -1.27
Years in corp. until CEO: Hired as CEO (n:31) 32 131 -0.54
Education: MSc in Economics (n:7) 31 95 1.12
Born: Provincial cities (n:23) 31 114 0.62
Education: Engineer (n:17) 28 95 0.68
Career foundation: Engineering, Science and Technical (n:12) 27 86 0.79
Career foundation: Marketing og Media (n:9) 27 85 -0.93
Years in corp. until CEO: 16-25 (n:19) 22 77 0.57
Education to executive: 13-19 years (n:32) 19 79 0.41
Education: Ma. Business School (n:19) 19 66 -0.53
Not Author (n:84) 19 341 -0.26
For a introduction to the plotting functions see https://github.com/Rsoc/soc.ca/wiki/Plotting
Common errors
The most common error here is that the encoding of the data file isn't UTF-8
but latin1
When you set something as passive then make sure that the name of the modality used is one that has not been altered by other functions like add.n()