How to use soc.ca This page is outdated

Intro

Here we will try to give an example of how to use the soc.ca package to do a specific correspondence analysis. We are going to assume that you already have the soc.ca package loaded from the source files and that the data is ready for analysis. Text written in italics describe possible errors, bugs or missing features that you might need to be aware of.

The data used in this analysis is data on the top 100 CEO's in Denmark. Much of the data is in Danish.

Software

Rstudio gets our warmest recommendations. In Rstudio you have a very friendly environment for learning R.

Even though we pride ourselves of our plots they almost always require some editing. Inkscape is perfect for this and many other things.

Getting the example

For the code for this example: wiki_example.R Not all the code in the source above is used in the example below.

Getting the data ready

First we are going to create an object with our data from a .csv file. Any other method will do. The data should be a matrix or data.frame with rows as individuals and variables as columns.

data <- read.csv(file="https://raw.github.com/Rsoc/soc.ca/master/wiki_data.csv", sep=";", encoding="UTF-8")

attach(data)

From this data object we are going to create three different objects: The active variables active, the supplementary variables sup and the identifier id. There are different requirements for each set of variables. The active and the supplementary variables have to be factors, and have no empty levels. Use str() to test if the active and sup is in factors. They are organized in data.frames. The identifier is preferably a vector with an unique value for each row in the data. anyDuplicated() tells you how many id's that are not unique and duplicated() will tell you which id's that are duplicates. The identifier variable is used to create the cloud of individuals and can be left out.

active      <- data.frame(careerprofile_maclean_cat, careerfoundation_maclean_cat,
                        years_between_edu_dir_cat, time_in_corp_before_ceo_cat,
                        age_as_ceo_cat, career_changes_cat2, mba, abroad, hd, phd,
                        education, author, placeofbirth, familyclass_bourdieu,
                        partnersfamily_in_whoswho, family_in_whoswho)

sup        <- data.frame(size_prestige, ownership_cat_2, sector, location)

id         <- navn

If your data has no valid identifier, then it can be constructed in the following way:

id <- 1:nrow(data)

Passive modalities

The specific correspondence analysis sets certain modalities as passive. You define what modalities that are going to be set as passive by using the set.passive() function. This function changes the default strings of text that is used to identify the passive modalities. These strings can be either a full or partial match to the names of the modalities. Technically the set.passive() function changes the ´soc.ca()´ functions default value, from ´passive="Missing"´ to whatever is specified. You can obtain the names of the modalities in your result by result$names.mod.

set.passive(c("MISSING", "Missing", "Irrelevant", "residence_value_cat2: Udlandet"))

Analysis

The actual analysis is made by the soc.ca() function, which has the following defaults: soc.ca(x, sup=NULL, identifier=NULL, passive="Missing"). The passive="Missing" option lets you choose another set of passive modalities than those specified by set.passive()

result <- soc.ca(active, sup, id)

Now we want to know more about our exciting analysis, so lets "print" the results.

result

                            Specific Correspondence Analysis:                             

                    Statistics                                   Scree plot               
	Active dimensions:                             21  |  1.     24.7%   ************
	Dimensions explaining 80% of inertia:           7  |  2.     18.5%   *********
	Active modalities:                             60  |  3.     14.0%   *******
	Supplementary modalities:                      20  |  4.      8.8%   ****
	Individuals:                                  100  |  5.      6.7%   ****
	Mass in subset                               0.96  |  6.      5.6%   ***

                                       The active variables:                                        
   careerprofile_maclean_cat (2) careerfoundation_maclean_cat (5)    years_between_edu_dir_cat (4) 
 time_in_corp_before_ceo_cat (5)               age_as_ceo_cat (4)          career_changes_cat2 (3) 
                         mba (3)                       abroad (3)                           hd (2) 
                         phd (2)                    education (9)                       author (2) 
                placeofbirth (5)         familyclass_bourdieu (7)    partnersfamily_in_whoswho (2) 
           family_in_whoswho (2)

                       These dimensions contributions are skewed towards (-):                         
                                    Dim.    +    -  +/-
                                      7. 0.32 0.68 0.48
                                     13. 0.30 0.70 0.42

This output gives us several forms of information on the quality of our analysis. Lets have a look at some of the less obvious ones.

The scree plot gives us how much of the total inertia that is explained by each of the first 6 dimensions.

Active dimensions: This is the number of dimensions which have a contribution above average.

Mass in subset: This is the amount of mass in the active subset of modalities. Remember that the mass of the passive modalities still contributes to the analysis. This measure tells us if we have put too much mass as passive. In this analysis we have 96% in the subset and therefore only 4% "passive mass". So we are in the clear.

The active variables: This gives us the names of the variables that was used in the analysis and how many modalities each variable has as active.

Then we get a balance evaluation of the active dimensions. We see that the 7th. dimension has 68% of its contribution on the negative side and the dimension is therefore skewed towards minus. We now know that this dimension is not good at differentiating on the + side. But we most likely won't look at the 7th dimension anyway so no worries.

Lets have a closer look at the first dimension.

contribution(result, 1)

Here we look at the first dimension or "dim=1".

                        The modalities contributing above average to dimension: 1.                         
 
                                                                      Ctr.    Cor.    Coord
education: Cand. Oecon AAU+SDU                                          85     284     1.94
time_in_corp_before_ceo_cat: 26+ år i firma før topchef                 80     280    -1.51
years_between_edu_dir_cat: Under 6 år om at blive direktør              68     250     1.19
time_in_corp_before_ceo_cat: 3-6 år i firma før topchef                 62     223     1.17
career_changes_cat2: Ingen karriereskift                                62     256    -0.90
education: Elevuddannelse                                               55     204    -1.03
careerprofile_maclean_cat: Karrierestart i de største virksomheder      41     297    -0.47
careerprofile_maclean_cat: Karrierestart i mindre virksomheder          39     174     0.62
mba: Almindelig MBA                                                     36     121     1.18
years_between_edu_dir_cat: 20+ år om at blive direktør                  31     108    -0.99
placeofbirth: Storkøbenhavn                                             31     133    -0.60
careerfoundation_maclean_cat: Ingeniør, teknik og videnskab             28      99    -0.85
time_in_corp_before_ceo_cat: 16-25 år i i firma før topchef             26     100    -0.65
hd: HD                                                                  25     110    -0.53
age_as_ceo_cat: Direktør: under 40 år                                   23      94     0.53
age_as_ceo_cat: Direktør: 45-49 år                                      22      91    -0.52
career_changes_cat2: 1-3 karriereskift                                  21     149     0.34
years_between_edu_dir_cat: 13-19 år om at blive direktør                20      92    -0.44
education: Naturvidenskab                                               19      65    -0.86
age_as_ceo_cat: Direktør 50+ år                                         18      74    -0.47
age_as_ceo_cat: Direktør: 40-44 år                                      17      71     0.46

This output gives the modalities that contribute above average to the first dimension. If we wanted the contribution of all modalities they could be found in result$ctr.mod or we could export the results, see later. A different approach is the tab.dim() which gives us the same as contribution() but order the results according to +/-. This is useful for publications.

tab.dim(result, 1) Again the 1 is for the first dimension.

                          Dimension 1. (+)                          
                                                                        Ctr    Coord
education: Cand. Oecon AAU+SDU                                           85     1.94
years_between_edu_dir_cat: Under 6 år om at blive direktør               68     1.19
time_in_corp_before_ceo_cat: 3-6 år i firma før topchef                  62     1.17
careerprofile_maclean_cat: Karrierestart i mindre virksomheder           39     0.62
mba: Almindelig MBA                                                      36     1.18
age_as_ceo_cat: Direktør: under 40 år                                    23     0.53
career_changes_cat2: 1-3 karriereskift                                   21     0.34
age_as_ceo_cat: Direktør: 40-44 år                                       17     0.46

                          Dimension 1. (-)                          
                                                                        Ctr    Coord
time_in_corp_before_ceo_cat: 26+ år i firma før topchef                  80    -1.51
career_changes_cat2: Ingen karriereskift                                 62    -0.90
education: Elevuddannelse                                                55    -1.03
careerprofile_maclean_cat: Karrierestart i de største virksomheder       41    -0.47
years_between_edu_dir_cat: 20+ år om at blive direktør                   31    -0.99
placeofbirth: Storkøbenhavn                                              31    -0.60
careerfoundation_maclean_cat: Ingeniør, teknik og videnskab              28    -0.85
time_in_corp_before_ceo_cat: 16-25 år i i firma før topchef              26    -0.65
hd: HD                                                                   25    -0.53
age_as_ceo_cat: Direktør: 45-49 år                                       22    -0.52
years_between_edu_dir_cat: 13-19 år om at blive direktør                 20    -0.44
education: Naturvidenskab                                                19    -0.86
age_as_ceo_cat: Direktør 50+ år                                          18    -0.47

Changing the output

From the output from tab.dim() we see that the first dimension is organized according to volume of "organizational capital". The more negative the coordinate the more "organizational capital" the CEO has. To make the maps easier to read we are going to invert the first dimension. A coordinate of -1.5 becomes +1.5. Because we are replicating the analysis in XXX, we are also inverting the second dimension and third dimension.

result <- invert(result, c(1,2,3))

But looking at the tables produced earlier it is clear that the labels are too long, technical and in Danish. So we are going to improve them. First we are exporting the labels into a .csv file, containing two columns; "New Label" and "Old Label".

export.label(result) This creates the file label_result.csv. If you want it saved in another location or under a different name, use the file= option.

Open the .csv file in a spreadsheet editor, like Libreoffice and edit the labels under "New Label" to its new version. But remember to leave "Old Label" unchanged. And remember to save in the UTF-8 format. Like a TV-chef we have prepared a little something in advance; a English translation in the "english.csv" file.

result <- assign.label(result, file="https://raw.github.com/Rsoc/soc.ca/master/wiki_labels.csv")

The assign.label() function does not require that all or any of the labels in the .csv file are present in the result object. This is very useful because the same label file can be used on different analysis.

Sometimes it may be unclear how rare a modality is, and it may therefore be useful to add the number of respondents in each modality to its label. For this we have the add.n() function.

result <- add.n(result)

Lets have a look at the second dimension with the new labels.

contribution(result, 2)

                The modalities contributing above average to dimension: 2.                
 
                                                                      Ctr.    Cor.    Coord
Author (n:16)                                                          102     341     1.34
Career changes: 4+  (n:19)                                              70     242    -1.02
Phd (n:8)                                                               69     210     1.55
Career foundation: State, University, Law and Organisations (n:11)      54     172     1.18
Education: Bsc. Business School (n:11)                                  51     161    -1.14
Born: Rural areas (n:21)                                                44     157    -0.77
Born: Aarhus, Odense and Aalborg (n:13)                                 43     140     0.97
Career start: Enterprise (n:31)                                         42     173    -0.62
Father: Craftsman or shopkeeper (n:7)                                   40     121    -1.27
Years in corp. until CEO: Hired as CEO (n:31)                           32     131    -0.54
Education: MSc in Economics (n:7)                                       31      95     1.12
Born: Provincial cities (n:23)                                          31     114     0.62
Education: Engineer (n:17)                                              28      95     0.68
Career foundation: Engineering, Science and Technical (n:12)            27      86     0.79
Career foundation: Marketing og Media (n:9)                             27      85    -0.93
Years in corp. until CEO: 16-25 (n:19)                                  22      77     0.57
Education to executive: 13-19 years (n:32)                              19      79     0.41
Education: Ma. Business School (n:19)                                   19      66    -0.53
Not Author (n:84)                                                       19     341    -0.26

Exporting

Plotting

For a introduction to the plotting functions see https://github.com/Rsoc/soc.ca/wiki/Plotting

Common errors

The most common error here is that the encoding of the data file isn't UTF-8 but latin1

When you set something as passive then make sure that the name of the modality used is one that has not been altered by other functions like add.n()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly