p04-DataFrames.Rmd

# Data Frame {#df-p04}

```{r 'setup', include=FALSE}
source('_common.R')
knitr::read_chunk('code/Functions.R')
```

## Basics

> Indexing starts from 1 in R and from 0 in Python.  

R provides `data.frame()` for tabular data structure. `{tibble}` & `{data.table}` are packages which extends its capabilities. Python module `r q_link('{pandas}')` provide similar capabilities

R `data.frame()` is a `list()` of variables of the same number of rows. It is a `matrix()` like structure whose columns may be of differing types. Similarly, Python Pandas `r q_link('pd.DataFrame()')` is a 2-dimensional data structure that can store data of different types in columns. 


- Create R `data.frame()`, `tibble::tibble()` and Python `r q_link('pd.DataFrame()')`

```{r 'R-DF-Create', decorate=TRUE}
nn <- 4L                                          #Number of Rows

# R Data Frame: integer, double, character, logical, factor
df_r <- data.frame(
  INT = 1:nn, NUM = seq(1, nn, 1), CHR = letters[1:nn], LGL = (1:nn %% 2) == 0, 
  FCT = factor(rep(c('No', 'Yes'), length.out = nn)))

# Tibble
tbl <- tibble::tibble(
  INT = 1:nn, NUM = seq(1, nn, 1), CHR = letters[1:nn], LGL = (1:nn %% 2) == 0, 
  FCT = factor(rep(c('No', 'Yes'), length.out = nn)))

stopifnot(all(identical(df_r, as.data.frame(tbl)), 
              identical(tbl, tibble::as_tibble(df_r))))
```

```{python 'Y-DF-Create', decorate=TRUE}
pp = 4                                            #Number of Rows

qq = {'INT': [i+1 for i in range(pp)], 
      'NUM': (float(i+1) for i in range(pp)),
      'CHR': [chr(i) for i in range(ord('a'), ord('a') +pp)],
      'LGL': [i % 2 == 1 for i in range(pp)],
      'FCT': pd.Categorical(['No', 'Yes'] * 2)}

df_y = pd.DataFrame(data = qq)                    #DataFrame from dict

```

```{r 'R-Compare', decorate=TRUE, include=FALSE}
# Verify that R Object and Python Variable are similar
aa <- py$df_y |> `attr<-`('pandas.index', NULL)
aa$INT <- as.integer(aa$INT)
stopifnot(identical(df_r, aa))
```

- Print DataFrame: 
  - R: `print()`, `head()`, `tail()`, `dplyr::slice()`
  - Python: `r q_link('lib.functions.print()')`, `r q_link('pd.DataFrame.head()')`, `r q_link('pd.DataFrame.tail()')`, `r q_link('pd.DataFrame.iloc')`

```{r 'R-DF-Head', decorate=TRUE}
aa <- df_r
if(FALSE) print(aa)                     #Data Frame prints ALL Rows (Avoid)
if(FALSE) print(tbl, n = 2)             #Tibble can take number of rows to print

stopifnot(identical(dplyr::slice(aa, 1:2), head(aa, 2)))
head(aa, 2)                                       #Subset by Head

tail(aa, 2)                                       #Subset by Tail
```

```{python 'Y-DF-Head', decorate=TRUE}
pp = df_y.copy()

assert(pp.head(2).equals(pp.iloc[:2]))
pp.head(2)


pp.tail(2)

```

- About DataFrame:
  - R: `class()`, `typeof()`, `dim()`, `names()`, `str()`, `summary()`
  - Python: `r q_link('lib.functions.type()')`, `r q_link('pd.DataFrame.shape')`, `r q_link('pd.DataFrame.columns')`, `r q_link('pd.DataFrame.describe()')`, `r q_link('pd.DataFrame.dtypes')`, `r q_link('pd.DataFrame.info()')`

```{r 'R-DF-About', decorate=TRUE}
aa <- df_r
class(aa)                                         #Class

typeof(aa)                                        #Type

dim(aa)                                           #Dimensions [Row, Column]

names(aa)                                         #Column Headers
```

```{python 'Y-DF-About', decorate=TRUE}
pp = df_y.copy()

print(type(pp))                                   #Explicitly Print type()


pp.shape                                          #Dimensions [Row, Column]


list(pp.columns)                                  #Column Headers
```

```{r 'R-DF-Structure', decorate=TRUE}
aa <- df_r
str(aa)                                           #Structure

summary(aa)                                       #Summary
```

```{python 'Y-DF-Structure', decorate=TRUE}
pp = df_y.copy()
list(pp.describe().index)                         #(Default) Summary of Num only
[x for x in pp.describe(include = 'all').index if x not in pp.describe().index]

pp.describe(include = 'all').loc[['count', 'max', 'unique']]


pp.dtypes                                         #data type of each column


pp.info(memory_usage = False)                     #Structure
```

```{python 'Y-DF-PrintMax', decorate=TRUE}
#Prevent the collapse of middle rows or columns into (...)
if(False):
    with pd.option_context('display.max_rows', None, 
                           'display.max_columns', None):
        print(pp.describe(include = 'all'))

```

- Select Columns:
  - R: `[]` (`base::Extract`), `dplyr::select()`, `data.frame()`, `with()`, `subset()` (Avoid)
  - Python: `r q_link('pd.[]')`, `r q_link('pd.DataFrame.drop()')`, `r q_link('pd.DataFrame.filter()')`

```{r 'R-DF-Select', decorate=TRUE}
aa <- df_r
names(aa)

bb <- dplyr::select(aa, 2:3)                                #Select by Position
dd <- select(aa, NUM, CHR)                                  #Select by Name
ee <- select(aa, -c(INT, LGL, FCT))                         #Drop Columns

ff <- data.frame('NUM' = aa$NUM, 'CHR' = aa$CHR)
# with() can be used to create an environment using data
gg <- with(aa, data.frame(NUM, CHR))
# [] is used for subsetting but note that 1-column subset is vector by default
hh <- aa[ , c('NUM', 'CHR'), drop = FALSE]
ii <- subset(aa, select = c(NUM, CHR), drop = FALSE)        #Avoid

stopifnot(all(sapply(list(dd, ee, ff, gg, hh, ii), identical, bb)))
str(bb)
```

```{python 'Y-DF-Select', decorate=TRUE}
pp = df_y.copy()
list(pp.columns)

qq = pp[['NUM', 'CHR']].copy()                              #Use List of Names
ss = pp.drop(columns = ['INT', 'LGL', 'FCT']).copy()        #Drop Columns
tt = pp.drop(['INT', 'LGL', 'FCT'], axis = 1).copy()        #0 Rows, 1 Columns
uu = pp.filter(['NUM', 'CHR']).copy()

assert(qq.equals(ss) and qq.equals(tt) and qq.equals(uu))
qq

```

- Rename:
  - R: `names()`, `dplyr::rename()`
  - Python: `r q_link('pd.DataFrame.rename()')`, `r q_link('pd.DataFrame.columns')`

```{r 'R-DF-Rename', decorate=TRUE}
aa <- df_r
names(aa)

names(aa)[c(1, 3)] <- c('A', 'C')                 #By Position
names(aa)[names(aa) == 'NUM'] <- 'B'              #By Name

aa <- dplyr::rename(aa, D = LGL, E = 5)           #New = Old (Reverse of Python)
names(aa)
```

```{python 'Y-DF-Rename', decorate=TRUE}
pp = df_y.copy()
list(pp.columns)


pp.rename(columns = {'INT': 'A', pp.columns[1]: 'B'}, inplace = True) #Old: New
pp.rename(columns = dict(zip(pp.columns[[3]], ['D'])),inplace = True) #Old, New
pp.columns.values[[2, 4]] = ['C', 'E']

list(pp.columns)

```

- Sort:
  - R: `order()`, `dplyr::arrange()`, `dplyr::desc()`
  - Python: `r q_link('pd.DataFrame.index')`

```{r 'R-DF-Order', decorate=TRUE}
aa <- df_r

# order() ALWAYS reorders the rownames
bb <- aa[order(aa$CHR, decreasing = TRUE), ]

# arrange() reorders character rownames but reinitialises them from 1 if integer
dd <- dplyr::arrange(aa, dplyr::desc(CHR))
row.names(dd) <- 4:1
stopifnot(identical(bb, dd))
dd
```

```{python 'Y-DF-Order', decorate=TRUE}
pp = df_y.copy()
list(pp.index)


pp.sort_values('CHR', ascending = False, inplace = True)
list(pp.index)

```

## RowNames - Index

- R `row.names()` are called `r q_link('pd.Index')` in Python and can be set by `r q_link('pd.DataFrame.set_index()')`
- R `data.frame()` has `row.names()` but `{tibble}` heavily discourage their usage
- Duplicated row indices are allowed in Python but not in R

```{r 'R-DF-RowNames', decorate=TRUE}
aa <- data.frame(x = 1:2)
row.names(aa)

row.names(aa) <- letters[1:2]                               #rownames

stopifnot(tibble::has_rownames(aa))
row.names(aa)
```

```{python 'Y-DF-RowNames', decorate=TRUE}
pp = pd.DataFrame(data = {'x': [1, 2]})
list(pp.index)


pp.set_index([pd.Index(['a', 'a'])], inplace = True)        #Duplicate index
pp.set_index([pd.Index(['a', 'b'])], inplace = True)
list(pp.index) 

```

## Copy

- For details [see here](#copy-p01)
- Python: For `deepcopy` of `DataFrame` either use `r q_link('{lib.copy}')` or `r q_link('pd.DataFrame.copy()')`

```{r 'R-DF-Copy', decorate=TRUE}
george <- fred <- data.frame(x = 11:13)                     #R Copy-on-Modify

# Before modification both bind to the same address (unlike Python)
stopifnot(identical(obj_addr(fred), obj_addr(george)))

aa <- obj_addr(fred)                    #Address before modification

fred[2, 'x'] <- NA                      #Modify
fred$x

stopifnot(obj_addr(fred) != aa)     #Bind to a different address (unlike Python)

# No change in non-modified object (george) address or value (same as Python)
stopifnot(obj_addr(george) == aa)
```

```{python 'Y-DF-Copy', decorate=TRUE}
fred = pd.DataFrame({'x': [1, 2, 3]})
george = fred.copy()                              #Deepcopy by default
percy  = copy.deepcopy(fred)                      #Deepcopy

# Deepcopy bind to different address even before modification (unlike R)
assert(id(fred) != id(george) != id(percy))

pp = id(fred)                           #Address before modification

fred.at[1, 'x'] = None                  #Modify
list(fred['x'])


assert(id(fred) == pp)                  #No change in address (unlike R)

# No change in non-modified objects (george, percy) address or value (same as R)
assert(george.equals(percy) and not george.equals(fred))

```

## Modify 

- R `[]` (`base::Extract`) 
- Python:
  - `r q_link('pd.DataFrame.at')` can only access a single value at a time (faster). It tries to maintain the datatype (fails sometimes). If column number is used by mistake, it creates a New Column (Avoid)
  - `r q_link('pd.DataFrame.loc')` can select multiple rows and/or columns (slower). It does not maintain the datatype and modifies the type silently.
  - `r q_link('pd.DataFrame.iat')`, `r q_link('pd.DataFrame.iloc')` are indices variants of the above

```{r 'R-DF-Modify', decorate=TRUE}
aa <- df_r
aa[1, 2] <- NA
aa[2, 'CHR'] <- NA
aa
```

```{python 'Y-DF-Modify', decorate=TRUE}
pp = df_y.copy()
pp.at[0, 'NUM'] = None
pp.loc[1, 'CHR'] = None
pp.iat[3, 1] = 10.0
pp.iloc[2, 0] = 30
pp

```

## Merge, Join

- A `mutating join` allows you to combine variables from two tables. [(Advanced R - Hadley)](https://r4ds.had.co.nz/relational-data.html)
  - These are `inner join`, `left outer join`, `right outer join`, `full outer join`, `cross join`
- An `inner join` matches pairs of observations whenever their `keys` are equal. Thus, unmatched rows are not included in the result. 
  - This means that generally inner joins are usually not appropriate for use in analysis because it is too easy to lose observations.
- An `outer join` keeps observations that appear in at least one of the tables. 
  - A `left join` keeps all observations in x.
  - A `right join` keeps all observations in y.
  - A `full join` keeps all observations in x and y.
  - A `cross join` returns the Cartesian product of rows from both tables. 
- See Table \@ref(tab:P04T01) for Joins of R & Python
- R: `merge()`, `dplyr::inner_join()`, `dplyr::left_join()`, `dplyr::right_join()`, `dplyr::full_join()`
- Python: `r q_link('pd.DataFrame.merge()')` `r q_so('q53645882')`

```{r 'R-Join-Mutating', decorate=TRUE}
aa <- tibble(ID = c(1, 2, 3), A = c('a1', 'a2', 'a3'))
bb <- tibble(ID = c(1, 2, 4), B = c('b1', 'b2', 'b4'))

# Inner Join
ab_j_inner <- dplyr::inner_join(aa, bb, by = 'ID') 
ab_m_inner <- merge(aa, bb, by = 'ID')

# Left Outer Join
ab_j_left <- dplyr::left_join(aa, bb, by = 'ID')
ab_m_left <- merge(aa, bb, by = 'ID', all.x = TRUE)

# Right Outer Join
ab_j_right <- dplyr::right_join(aa, bb, by = 'ID')
ab_m_right <- merge(aa, bb, by = 'ID', all.y = TRUE)

# Full Outer Join
ab_j_full <- dplyr::full_join(aa, bb, by = 'ID')
ab_m_full <- merge(aa, bb, by = 'ID', all = TRUE)

# Cross Join
ab_j_cross <- full_join(aa, bb, by = character(), suffix = c('_x', '_y'))
ab_m_cross <- merge(aa, bb, by=NULL, suffixes = c('_x', '_y')) |> arrange(ID_x)
```

```{r 'R-Join-Verify', decorate=TRUE, include=FALSE}
stopifnot(all(identical(as.data.frame(ab_j_inner), ab_m_inner), 
              identical(as.data.frame(ab_j_left),  ab_m_left),
              identical(as.data.frame(ab_j_right), ab_m_right),
              identical(as.data.frame(ab_j_full),  ab_m_full), 
              identical(as.data.frame(ab_j_cross), ab_m_cross)))
```

```{python 'Y-Join-Mutating', decorate=TRUE}
pp = pd.DataFrame({'ID': (1, 2, 3), 'A': ('a1', 'a2', 'a3')}) 
qq = pd.DataFrame({'ID': (1, 2, 4), 'B': ('b1', 'b2', 'b4')}) 

pq_inner = pp.merge(qq, on = 'ID', how = 'inner')           #Inner Join
pq_left  = pp.merge(qq, on = 'ID', how = 'left')            #Left Outer Join
pq_right = pp.merge(qq, on = 'ID', how = 'right')           #Right Outer Join
pq_full  = pp.merge(qq, on = 'ID', how = 'outer')           #Full Outer Join
pq_cross = pp.merge(qq, how = 'cross')                      #Cross Join

```

```{r 'R-Join-Y-Verify', decorate=TRUE, include=FALSE}
ab_inner <- py$pq_inner |> `attr<-`('pandas.index', NULL) |> 
    mutate(across(where(is.list), q_NULL_to_NA)) |> as_tibble()
ab_left <- py$pq_left |> `attr<-`('pandas.index', NULL) |> 
    mutate(across(where(is.list), q_NULL_to_NA)) |> as_tibble()
ab_right <- py$pq_right |> `attr<-`('pandas.index', NULL) |> 
    mutate(across(where(is.list), q_NULL_to_NA)) |> as_tibble()
ab_full <- py$pq_full |> `attr<-`('pandas.index', NULL) |> 
    mutate(across(where(is.list), q_NULL_to_NA)) |> as_tibble()
ab_cross <- py$pq_cross |> `attr<-`('pandas.index', NULL) |> 
    mutate(across(where(is.list), q_NULL_to_NA)) |> as_tibble()
stopifnot(all(identical(ab_j_inner, ab_inner), 
              identical(ab_j_left,  ab_left), 
              identical(ab_j_right, ab_right), 
              identical(ab_j_full,  ab_full), 
              identical(ab_j_cross, ab_cross)))
```

```{r 'R-Join-Display', echo=FALSE}
# Print Multiple Kable Tables
kables(list(q_kbl(ab_j_inner), q_kbl(ab_j_left), 
            q_kbl(ab_j_right), q_kbl(ab_j_full), q_kbl(ab_j_cross)), 
  caption = paste0('Inner Join, Left Outer Join, Right Outer Join, ', 
                   'Full Outer Join, Cross Join'), label = 'P04T01')
```

- Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables.
  - `semi_join(x, y)` keeps all observations in x that have a match in y.
  - `anti_join(x, y)` drops all observations in x that have a match in y.


```{r 'R-Join-Filtering', decorate=TRUE}
ab_j_semi <- semi_join(aa, bb, by = 'ID')                   # Semi Join
ab_j_anti <- anti_join(aa, bb, by = 'ID')                   # Anti Join
```

```{r 'R-Join-Display-2', echo=FALSE}
kables(list(q_kbl(ab_j_semi), q_kbl(ab_j_anti)), 
       caption = 'Semi Join, Anti Join', label = 'P04T02')
```

## Sets

- These operations work with a complete row, comparing the values of every variable. 
  - Thus, these expect both tables /df to have the same variables (columns), and treat the observations (rows) like sets.
  - Inner Join vs. Intersect
    - The `INNER JOIN` will return duplicates, if `id` is duplicated in either table. `INTERSECT` removes duplicates. 
    - The `INNER JOIN` will never return `NULL`, but `INTERSECT` will return `NULL`.
  - Inner Join vs. semi-join
    - With a `semi-join`, each record in the first set is returned only once, regardless of how many matches there are in the second set.  

```{r 'R-DF-Sets', decorate=TRUE}
aa <- tibble(A = c(1, 2), B = c(1, 1))
bb <- tibble(A = c(1, 1), B = c(1, 2))

ab_isect <- intersect(aa, bb)                               # Intersection
ab_union <- union(aa, bb)                                   # Union
ab_sdiff <- setdiff(aa, bb)                                 # x - y
ba_sdiff <- setdiff(bb, aa)                                 # y - x
```

```{r 'R-Join-Display-3', echo=FALSE}
kables(list(q_kbl(ab_isect), q_kbl(ab_union), q_kbl(ab_sdiff), q_kbl(ba_sdiff)), 
       caption = 'Intersect, Union, x-y, y-x', label = 'P04T03')
```

```{r 'R-DF-Set-Compare', decorate=TRUE}
# Two 2 in First & Three 3 in Second
aa <- tibble(ID = c(1, 2, 2, 3, 4), 
             A = c('a1', 'a21', 'a22', 'a31', 'a4'))
bb <- tibble(ID = c(1, 2, 3, 3, 3, 5), 
             A = c('a1', 'a21', 'a31', 'a32', 'a33', 'a5'))
#
ii_inn <- inner_join(aa, bb, by = 'ID') 
jj_its <- intersect(aa, bb)
kk_sem <- semi_join(aa, bb, by = 'ID')
#
str(ii_inn, vec.len = nrow(ii_inn))     #Duplicate IDs of Both
str(jj_its, vec.len = nrow(jj_its))     #No Duplicate IDs of Either
str(kk_sem, vec.len = nrow(kk_sem))     #Duplicates of First found in Second
setdiff(kk_sem, jj_its)                 #Extra in Semi Join over Intersection
```


## Missing Values

- R: `NA` represents the missing values. 
  - `NULL` assignment to a `data.frame` element is problematic `r q_so('47402059')`, `r q_so('45023847')`. The column needs to be a `list` because `vector` cannot contain `NULL`. Further, `list()` has to be used instead of `c()` because `NULL` cannot be concatenated.
- Python Pandas `r q_link('pd.DataFrame()')` treats `None` in `numeric` columns as `np.nan`. It keeps `None` or `np.nan` in `object` columns as it is, however, internally these all are treated as missing values
- Note: `{reticulate}` currently converts `object` columns with `None` or `np.nan` to `list` containing `NULL` or `NaN`. It also converts `np.nan` in `numeric` column to `NaN`. All of these should be `NA`
  
```{r 'R-DF-NULL', decorate=TRUE}
# NULL cannot be concatenated
c(NULL, NULL)
list(NULL, NULL)

# Only list can contain NULL
aa <- data.frame(x = 1:3, y = I(list(NULL, 'b', 'c')))
aa$x <- list(4, NULL, 6)
str(aa)

bb <- aa |> mutate(across(where(is.list), q_NULL_to_NA))
str(bb)
```

```{r 'R-Q03', eval=FALSE, ref.label=c('Q03-null-na')}
#
```

```{python 'Y-DF-NULL', decorate=TRUE}
pp = pd.DataFrame({'x': (None, 1, 2), 'y': ('a', None, np.nan)}) 
pp.info()
pp

```

```{python 'Y-Verify', decorate=TRUE, include=FALSE}
# Count & List the Imported Modules in Python 
q_mods = [v.__name__ for k, v in globals().items() 
    if type(v) is types.ModuleType and not k.startswith('__')]
len(q_mods)
', '.join(q_mods)

```

```{r 'R-Verify', decorate=TRUE, include=FALSE}
if(FALSE) py_config()         #Python Configuration
if(FALSE) q_url[ , 'URL']     #List of URL of this Page
if(FALSE) q_()                #R Objects of this Page excluding 'q_*'
```