Skip to content

Commit

Permalink
Bugfix for exponential-lognormal mixture model: remove weight roundin…
Browse files Browse the repository at this point in the history
…g during EM iterations (#50)

* Remove weight rounding in M-step of ELMM
* Remove several roundings for printed parameters
* Remove rounding for printed parameters in LMM
* Test pipeline: Nextflow version compatible DSL1
* 600 default EM iterations; missing max_iter in LMM
  • Loading branch information
Cecilia-Sensalari authored Jun 1, 2023
1 parent 0d049d4 commit fd8aff4
Show file tree
Hide file tree
Showing 10 changed files with 40 additions and 44 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test_pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
run: |
cd test
mv nextflow.config custom_nextflow.config
nextflow run ../main.nf --config config_elaeis.txt -with-docker ksrates
NXF_VER=21.10.6 nextflow run ../main.nf --config config_elaeis.txt -with-docker ksrates
- name: Visualize output files
if: ${{ always() }}
Expand Down
4 changes: 2 additions & 2 deletions doc/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ The following can be used as a template::
kde_bandwidth_modifier = 0.4
plot_adjustment_arrows = no
num_mixture_model_initializations = 10
max_mixture_model_iterations = 300
max_mixture_model_iterations = 600
max_mixture_model_components = 5
max_mixture_model_ks = 5
extra_paralogs_analyses_methods = no
Expand All @@ -232,7 +232,7 @@ The following can be used as a template::
* **kde_bandwidth_modifier**: modifier to adjust the fitting of the KDE curve on the underlying whole-paranome or anchor *K*:sub:`S` distribution. The KDE Scott's factor internally computed by SciPy tends to produce an overly smooth KDE curve, especially with steep WGD peaks, and therefore it is reduced by multiplying it by a modifier. Decreasing the modifier leads to tighter fits, increasing it leads to smoother fits, and setting it to 1 gives the default KDE factor. Note that a too small factor is likely to take into account data noise. [Default: 0.4]
* **plot_adjustment_arrows**: flag to toggle the plotting of rate-adjustment arrows below the adjusted mixed paralog--ortholog *K*:sub:`S` plot. These arrows start from the original unadjusted ortholog divergence *K*:sub:`S` estimate and end on the rate-adjusted estimate (options: "yes" and "no"). [Default: "no"]
* **num_mixture_model_initializations**: number of times the EM algorithm is initialized (either for the random initialization in the exponential-lognormal mixture model or for k-means in the lognormal mixture model). [Default: 10]
* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300]
* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 600]
* **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5]
* **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
4 changes: 2 additions & 2 deletions doc/source/paralogs_analyses.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Among its outputs (for a complete list see section :ref:`output_files`), the ELM
* The model initialization approach is stored in column 1 ``Model`` according to a numerical code (1: data-driven, 2: hybrid data-driven plus a random lognormal component, 3: random initialization with exponential component and one lognormal component, 4: random initialization with exponential component and two lognormal components; higher numbers feature random initialization with exponential component and increasing number of lognormal components).
* The initialization round is stored in column 2 ``Initialization``. By default each model type (except type 1) is initialized and fitted 10 times, so this column shows numbers from 1 to 10.
* The BIC and loglikelihood scores for the fitted model are stored in columns 3 ``BIC`` and 4 ``Loglikelihood``.
* The number of algorithm iterations needed to reach convergence is stored in column 5 ``Convergence``. If greater than 300 iterations would be needed, convergence is not reached and the cell will show *NA*.
* The number of algorithm iterations needed to reach convergence is stored in column 5 ``Convergence``. If greater than (by default) 600 iterations would be needed, convergence is not reached and the cell will show *NA*.
* The fitted exponential rate parameter and its component weight are stored in columns 6 ``Exponential_Rate`` and 7 ``Exponential_Weight``.
* The mean, standard deviation and weight of the fitted Normal components used to define the correspondent lognormal components are stored in columns 8 to 10: ``Normal_Mean``, ``Normal_SD`` and ``Normal_Weight``. When there are multiple lognormal components, the data for each of them are stored in a separate row (the number of rows for each model and initialization is thus equal to the number of lognormal components).

Expand Down Expand Up @@ -106,7 +106,7 @@ Among its outputs (for a complete list see section :ref:`output_files`), the LMM

* The model type is stored in column 1 ``Model`` according to a numerical code (1: one lognormal component, 2: two lognormal components, 3: three lognormal components; and so on).
* The BIC and loglikelihood scores for the fitted model are stored in columns 2 ``BIC`` and 3 ``Loglikelihood``
* The number of algorithm iterations needed to reach convergence is stored in column 4 ``Convergence``. If greater than 300 iterations would be needed, convergence is not reached and the cell will show *NA*.
* The number of algorithm iterations needed to reach convergence is stored in column 4 ``Convergence``. If greater than (by default) 600 iterations would be needed, convergence is not reached and the cell will show *NA*.
* The mean, standard deviation and weight of the fitted Normal components used to define the correspondent lognormal components are stored in columns 5 to 7: ``Normal_Mean``, ``Normal_SD`` and ``Normal_Weight``. When there are multiple lognormal components, the data for each of them are stored in a separate row (the number of rows per model is thus equal to the number of components).


Expand Down
2 changes: 1 addition & 1 deletion example/config_expert.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
logging_level = info
kde_bandwidth_modifier = 0.4
plot_adjustment_arrows = no
max_mixture_model_iterations = 300
max_mixture_model_iterations = 600
num_mixture_model_initializations = 10
extra_paralogs_analyses_methods = no
max_mixture_model_components = 5
Expand Down
2 changes: 1 addition & 1 deletion ksrates/cluster_anchor_ks.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def cluster_anchor_ks(config_file, expert_config_file, correction_table_file, pa
color_list = config.get_color_list()
plot_correction_arrows = config.plot_correction_arrows()

max_EM_iterations = config.get_max_EM_iterations() # default 300
max_EM_iterations = config.get_max_EM_iterations() # default 600
num_EM_initializations = config.get_num_EM_initializations() # how many times the fitting with N given components is initialized
logging.info(f" - maximum EM iterations: {max_EM_iterations}")
logging.info(f" - number of EM initializations: {num_EM_initializations}")
Expand Down
2 changes: 1 addition & 1 deletion ksrates/exp_log_mixture.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def exp_log_mixture(config_file, expert_config_file, paralog_tsv_file, correctio

# Parameters used during the mixture modeling
max_ks_EM = config.get_max_ks_for_mixture_model(max_ks_para) # upper Ks limit considered for the mixture model fitting
max_EM_iterations = config.get_max_EM_iterations() # default 300
max_EM_iterations = config.get_max_EM_iterations() # default 600
num_EM_initializations = config.get_num_EM_initializations() # how many times the fitting with N given components is initialized
# in the random method and in the "peak + random" method). default 10
max_num_comp = config.get_max_mixture_model_components() # max number of components used in the fitting with random components (exp + buffer lognormal + lognormal)
Expand Down
20 changes: 10 additions & 10 deletions ksrates/fc_configfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -607,7 +607,7 @@ def get_kde_bandwidth_modifier(self):

def get_max_EM_iterations(self):
"""
Gets the maximum number of EM iterations to apply during mixture modeling [Default: 300].
Gets the maximum number of EM iterations to apply during mixture modeling [Default: 600].
:return max_mixture_model_iterations: max number of iterations for the exponential-maximization
algorithm
Expand All @@ -620,20 +620,20 @@ def get_max_EM_iterations(self):
except Exception:
pass
if not isinstance(max_iter, int):
logging.warning(f'Unrecognized field in expert configuration file [max_mixture_model_iterations = {max_iter}]. Please choose a positive integer. Default choice will be applied [300]')
max_iter = 300
logging.warning(f'Unrecognized field in expert configuration file [max_mixture_model_iterations = {max_iter}]. Please choose a positive integer. Default choice will be applied [600]')
max_iter = 600
elif max_iter <= 0:
logging.warning(f'Unrecognized field in expert configuration file [max_mixture_model_iterations = {max_iter}]. Please choose a positive integer. Default choice will be applied [300]')
max_iter = 300
elif max_iter < 100:
logging.warning(f'Unrecognized field in expert configuration file [max_mixture_model_iterations = {max_iter}]. Please choose a positive integer. Default choice will be applied [600]')
max_iter = 600
elif max_iter <= 300:
logging.warning(f"A small maximum number of mixture model iterations [max_mixture_model_iterations = {max_iter}] can produce poor fitting.")
elif max_iter > 400:
elif max_iter > 600:
logging.warning(f"A large maximum number of mixture model iterations [max_mixture_model_iterations = {max_iter}] can increase the runtime.")
except Exception:
logging.warning(f'Missing field in expert configuration file [max_mixture_model_iterations]. Please choose a positive integer. Default choice will be applied [300]')
max_iter = 300
logging.warning(f'Missing field in expert configuration file [max_mixture_model_iterations]. Please choose a positive integer. Default choice will be applied [600]')
max_iter = 600
else:
max_iter = 300
max_iter = 600
return max_iter


Expand Down
24 changes: 10 additions & 14 deletions ksrates/fc_exp_log_mixture.py
Original file line number Diff line number Diff line change
Expand Up @@ -588,7 +588,7 @@ def m_step(num_comp, data, posteriors):
new_weights = []
for k in range(0, num_comp):
weight = points_per_k[k] / len(data)
new_weights.append(round(weight, 2))
new_weights.append(weight)

# Computes updated means and stdevs of the lognormal components
new_means, new_stdevs = [], []
Expand All @@ -607,7 +607,7 @@ def em(num_comp, max_iter, ks_data_proxy, init_lambd, init_means, init_stdevs, i
"""
Performs the EM algorithm with a given number of components, their parameters on the proxy dataset
for the weighted Ks paranome (the deconvoluted data). If convergence of the algorithm based on
threshold of 1e-6 is not reached in 300 iterations, it prints a warning.
threshold of 1e-6 is not reached in 600 iterations, it prints a warning.
:param num_comp: number of components (1 exponential distribution + N lognormal components)
:param max_iter: maximum number of iterations for the EM steps
Expand Down Expand Up @@ -713,12 +713,12 @@ def print_details_model(model_id, model_iteration, max_model_iteration, max_num_
if convergence_flag:
outfile.write(f"The EM algorithm has reached convergence after {convergence_iteration} iterations\n")
else:
outfile.write(f"The EM algorithm didn't reach convergence after {max_em_iteration} iterations (diff is ~{round(loglik_diff, 2)})\n")
outfile.write(f"The EM algorithm didn't reach convergence after {max_em_iteration} iterations (diff is {loglik_diff})\n")
outfile.write("\n")

print_parameters("fitted", round(fitted_means, 2), round(fitted_stdevs, 2), round(fitted_lambd, 2), round(fitted_weights, 2), outfile)
outfile.write(f"Log-likelihood: {round(final_loglik, 3)}\n")
outfile.write(f"BIC: {round(bic, 3)}\n")
print_parameters("fitted", fitted_means, fitted_stdevs, fitted_lambd, fitted_weights, outfile)
outfile.write(f"Log-likelihood: {final_loglik}\n")
outfile.write(f"BIC: {bic}\n")

# if EM_data_random or EM_random:
if (model_id == 2 or model_id == 5) and model_iteration == max_model_iteration:
Expand All @@ -736,8 +736,8 @@ def print_details_model(model_id, model_iteration, max_model_iteration, max_num_
else:
model_iteration = int(model_iteration)
for mean, stdev, weight in zip(fitted_means, fitted_stdevs, fitted_weights[1:]):
parameter_table.append([model_id, model_iteration, round(bic, 3), round(final_loglik, 3), convergence_iteration,
round(fitted_lambd, 2), fitted_weights[0], round(mean, 2), round(stdev, 2), weight])
parameter_table.append([model_id, model_iteration, bic, final_loglik, convergence_iteration,
fitted_lambd, fitted_weights[0], mean, stdev, weight])


def make_parameter_table_file(parameter_table, species):
Expand Down Expand Up @@ -773,11 +773,7 @@ def print_parameters(starting_or_fitted, means, stdevs, lambd, weights, outfile)
outfile.write(f" EXP : {lambd}\n")
for n in range(len(means)):
outfile.write(f" NORM {n+1}: {means[n]} +- {stdevs[n]}\n")
rounded_weights = []
for w in weights:
w = round(w, 2)
rounded_weights.append(w)
outfile.write(f" WEIGHT: {rounded_weights}\n")
outfile.write(f" WEIGHT: {weights}\n")


def plot_init_comp(ax_ks, ax_logks, means, stdevs, lambd, weights, plot_logtranformed=True):
Expand Down Expand Up @@ -986,7 +982,7 @@ def eval_best_model(bic_dict, outfile):
if delta_bic > 2: j = 1
if delta_bic > 6: j = 2
if delta_bic > 10: j = 3
outfile.write(f" Model {m}: delta(BIC) = {round(delta_bic, 2):>8} ({l[j]})\n")
outfile.write(f" Model {m}: delta(BIC) = {delta_bic:>8} ({l[j]})\n")
outfile.write("\n")
return best_model_id

Expand Down
22 changes: 11 additions & 11 deletions ksrates/fc_lognormal_mixture.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def inspect_bic(bic, outfile):
outfile.write("\n")


def log_components(X, model_id, m, outfile, parameter_table, max_iter=300):
def log_components(X, model_id, m, outfile, parameter_table, max_iter):
"""
Modified from wgd.
Expand All @@ -112,20 +112,20 @@ def log_components(X, model_id, m, outfile, parameter_table, max_iter=300):

outfile.write("FITTED PARAMETERS:\n")
for j in range(len(m.means_)):
outfile.write(f" NORM {j+1}: {round(m.means_[j][0], 2)} +- {round(sqrt(m.covariances_[j][0][0]), 2)}\n")
outfile.write(f" NORM {j+1}: {m.means_[j][0]} +- {sqrt(m.covariances_[j][0][0])}\n")

parameter_table.append([model_id, round(m.bic(X), 3), round(m.lower_bound_, 3), convergence,
round(m.means_[j][0], 2), round(m.covariances_[j][0][0], 2), round(m.weights_[j], 2)])
parameter_table.append([model_id, m.bic(X), m.lower_bound_, convergence,
m.means_[j][0], m.covariances_[j][0][0], m.weights_[j]])

rounded_weights = []
weight_list = []
for w in m.weights_:
rounded_weights.append(round(w, 2))
outfile.write(f" WEIGHT: {rounded_weights}\n")
outfile.write(f"Log-likelihood: {round(m.lower_bound_, 3)}\n")
outfile.write(f"BIC: {round(m.bic(X), 3)}\n\n")
weight_list.append(w)
outfile.write(f" WEIGHT: {weight_list}\n")
outfile.write(f"Log-likelihood: {m.lower_bound_}\n")
outfile.write(f"BIC: {m.bic(X)}\n\n")


def fit_gmm(X, n1, n2, outfile, parameter_table, max_iter=300, n_init=1, **kwargs):
def fit_gmm(X, n1, n2, outfile, parameter_table, max_iter=600, n_init=1, **kwargs):
"""
Modified from wgd.
Compute Gaussian mixtures for different numbers of components
Expand All @@ -150,7 +150,7 @@ def fit_gmm(X, n1, n2, outfile, parameter_table, max_iter=300, n_init=1, **kwarg
n_components=N[i], covariance_type='full', max_iter=max_iter,
n_init=n_init, tol=1e-6, **kwargs
).fit(X)
log_components(X, i+1, models[i], outfile, parameter_table)
log_components(X, i+1, models[i], outfile, parameter_table, max_iter)
else:
logging.warning(f"Lognormal mixture model with {N[i]} or more components is skipped due to too few input Ks data points")
break
Expand Down
2 changes: 1 addition & 1 deletion ksrates/lognormal_mixture.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def lognormal_mixture(config_file, expert_config_file, paralog_tsv_file, anchors

# Parameters used during the mixture modeling
max_ks_EM = config.get_max_ks_for_mixture_model(max_ks_para) # upper Ks limit considered for the mixture model fitting
max_EM_iterations = config.get_max_EM_iterations() # default 300
max_EM_iterations = config.get_max_EM_iterations() # default 600
num_EM_initializations = config.get_num_EM_initializations() # how many time k-means is performed. Default 10.
max_num_comp = config.get_max_mixture_model_components()

Expand Down

0 comments on commit fd8aff4

Please sign in to comment.