Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change in AUCell_buildRankings() on dgCMatrix objects #28

Open
hpages opened this issue May 20, 2022 · 7 comments
Open

Breaking change in AUCell_buildRankings() on dgCMatrix objects #28

hpages opened this issue May 20, 2022 · 7 comments
Assignees
Labels

Comments

@hpages
Copy link
Contributor

hpages commented May 20, 2022

Hi,

This used to work:

library(Matrix)
library(AUCell)
exprMatrix <- cbind(cell1=100*4:0, cell2=c(500, 0, 90, 0, 750))
rownames(exprMatrix) <- sprintf("gene%02d", seq_len(nrow(exprMatrix)))
AUCell_buildRankings(as(exprMatrix, "dgCMatrix"), plotStats=FALSE, verbose=FALSE)

but with the latest AUCell (version 1.19.1), it no longer works:

AUCell_buildRankings(as(exprMatrix, "dgCMatrix"), plotStats=FALSE, verbose=FALSE)
# Error in .AUCell_buildRankings(exprMat = exprMat, featureType = featureType,  : 
#   To use a dgCMatrix as input set 'splitByBlocks=TRUE'.

Is this change intended?

This kind of change should be considered carefully as it breaks existing code, like the code used in the OSCA book. See https://bioconductor.org/checkResults/3.16/books-LATEST/OSCA.basic/nebbiolo2-buildsrc.html

If this change was introduced by mistake, I would strongly suggest (again) that you improve your unit tests to cover the various situations that AUCell_buildRankings() is expected to handle.

Thanks,
H.

@s-aibar
Copy link
Member

s-aibar commented May 23, 2022

Dear @hpages ,

In the last release we have added support for sparse matrices, so that that is the current expected behaviour.

When using dense matrices (as before), the behavior should have not changed, and you can keep using the same code.

However, this was not optimal, so we have added the possibility to use sparse matrices, and run the rankings in chunks. This is more efficient and facilitates the analyses of bigger datasets or in more limited computers. Therefore, AUCell_buildRankings now requires to add the argument splitByBlocks=TRUE (see example below). Ideally, the behaviour of the fucntion would be exactly the same as with dense matrices (without requiring any difference in inputs), but I have not found the way to do so...
Sorry for the confusion and the inconvenience for the dependencies,
Sara

example:

library(Matrix)
library(AUCell)
exprMatrix <- cbind(cell1=100*4:0, cell2=c(500, 0, 90, 0, 750))
rownames(exprMatrix) <- sprintf("gene%02d", seq_len(nrow(exprMatrix)))

# With standard matrix: Same behaviour as before
AUCell_buildRankings(exprMatrix, plotStats=FALSE, verbose=FALSE)


# With sparse matrix: Neet to set the argument 'splitByBlocks'
AUCell_buildRankings(as(exprMatrix, "dgCMatrix"), plotStats=FALSE, verbose=FALSE, splitByBlocks=TRUE)

@s-aibar s-aibar closed this as completed May 23, 2022
@hpages
Copy link
Contributor Author

hpages commented May 23, 2022

Can we please be less eager to close this?

In the last release we have added support for sparse matrices, so that that is the current expected behaviour.

There's a difference between adding support for sparse matrices and improving support for sparse matrices.

As I reported above, previous versions of the package (e.g. AUCell 1.16.0) already supported dgCMatrix objects:

library(Matrix)
library(SummarizedExperiment)
library(AUCell)
packageVersion("AUCell")
# [1] ‘1.16.0’

m1 <- cbind(cell1=100*4:0, cell2=c(500, 0, 90, 0, 750))
rownames(m1) <- sprintf("gene%02d", seq_len(nrow(m1)))
m2 <- as(m1, "dgCMatrix")
m2
# 5 x 2 sparse Matrix of class "dgCMatrix"
#        cell1 cell2
# gene01   400   500
# gene02   300     .
# gene03   200    90
# gene04   100     .
# gene05     .   750

rankings <- AUCell_buildRankings(m2, plotStats=FALSE, verbose=FALSE)
assay(rankings)
#         cells
# genes    cell1 cell2
#   gene01     1     2
#   gene02     2     5
#   gene03     3     3
#   gene04     4     4
#   gene05     5     1

Note that to make this work, the package was providing an AUCell_buildRankings method for dgCMatrix objects:

> selectMethod(AUCell_buildRankings, "dgCMatrix")
Method Definition:

function (exprMat, plotStats = TRUE, nCores = 1, mctype = c("domc")[1], 
    keepZeroesAsNA = FALSE, verbose = TRUE, ...) 
{
    .local <- function (exprMat, plotStats = TRUE, nCores = 1, 
        mctype = c("domc")[1], keepZeroesAsNA = FALSE, verbose = TRUE) 
    {
        exprMat <- as.matrix(exprMat)
        .AUCell_buildRankings(exprMat = exprMat, plotStats = plotStats, 
            nCores = nCores, mctype = mctype, keepZeroesAsNA = keepZeroesAsNA, 
            verbose = verbose)
    }
    .local(exprMat, plotStats, nCores, mctype, keepZeroesAsNA, 
        verbose, ...)
}
<bytecode: 0x55de0f23cde8>
<environment: namespace:AUCell>

Signatures:
        exprMat    
target  "dgCMatrix"
defined "dgCMatrix"

How this method was handling dgCMatrix objects was admittedly not efficient, and I can see that you decided to improve that.

However, adding the splitByBlocks argument and forcing the user to set it to TRUE on dgCMatrix objects is completely unnecessary. If the function receives a dgCMatrix object, then it should just do the right thing, and there should be no need for the end user to change the way they call AUCell_buildRankings(). In other words, ordinary (a.k.a. dense) matrices and sparse matrices should be interchangeable, in a manner that is transparent to the user.

As I said, changing the interface of the function in a way that breaks analyses that were already using the function on sparse data (like the OSCA book) should not be done lightly.

The Bioconductor project has a strong commitment to reproducibility, and Bioconductor packages are expected to avoid changes that break backward compatibility, or at least to minimize their impact when those changes can't be avoided. In your case breaking backward compatibility can and should be avoided.

Please reconsider.

Thanks,
H.

@alanocallaghan
Copy link

alanocallaghan commented May 23, 2022

Furthermore if splitByBlocks should always be TRUE when the input is a dgCMatrix, then you could set this automatically in the argument list.

Something like this:

buildrankings <- function(matrix, splitByBlocks = inherits(matrix, "dgCMatrix")) {}

would set splitByBlocks to be TRUE as default iff matrix is a dgCMatrix.

@s-aibar s-aibar reopened this May 24, 2022
@s-aibar s-aibar self-assigned this May 24, 2022
@s-aibar
Copy link
Member

s-aibar commented May 24, 2022

Ok, thank you for the comments, I will look into it during the next weeks.

@hpages
Copy link
Contributor Author

hpages commented Aug 30, 2022

@s-aibar Just a reminder that this issue is still pending and has been breaking the OSCA book for several months: https://bioconductor.org/checkResults/3.15/books-LATEST/OSCA.basic/nebbiolo1-buildsrc.html Thanks!

@alanocallaghan Note that is() is preferred over inherits() for S4 classes so I would just use is(matrix, "dgCMatrix") here. It's also less typing 😉

@alanocallaghan
Copy link

Any chance this can be reviewed?

@s-aibar
Copy link
Member

s-aibar commented Oct 25, 2022

Hello,

I have updated the Development brach of the package to include your suggestions. They will be published with the next version of Bioconductor (scheduled for Nov 2).

Thank you for your feedback and suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants