Torch probably needs to be less than 2.0 #70

sgbaird · 2023-06-07T01:15:30Z

https://github.com/sparks-baird/matsci-opt-benchmarks/actions/runs/5193086890/jobs/9363184385

def test_matbench_metric_calculator():
        actual_param = userparam_to_crabnetparam(user_param, seed=50)
>       matbench_metric_calculator(actual_param, dummy=True)

tests/crabnet_hyperparameter_test.py:41: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/default/lib/python3.10/site-packages/matsci_opt_benchmarks/crabnet_hyperparameter/utils/parameters.py:361: in matbench_metric_calculator
    cb.fit(train_df=train_df)
.tox/default/lib/python3.10/site-packages/crabnet/crabnet_.py:409: in fit
    self._train()
.tox/default/lib/python3.10/site-packages/crabnet/crabnet_.py:628: in _train
    self.optimizer.step()
.tox/default/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:69: in wrapper
    return wrapped(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (SWA (
Parameter Group 0
    betas: (0.89992, 0.7[49](https://github.com/sparks-baird/matsci-opt-benchmarks/actions/runs/5193086890/jobs/9363184385#step:5:50)95)
    eps: 1e-06
    initial_lr: 0.0001
    lookahead_alpha: 0.5
    lookahead_k: 6
    lookahead_step: 0
    lr: 0.0001
    n_avg: 0
    step_counter: 0
    weight_decay: 0.0
),)
kwargs = {}
self = SWA (
Parameter Group 0
    betas: (0.89992, 0.74995)
    eps: 1e-06
    initial_lr: 0.0001
    lookahead_alpha: 0.5
    lookahead_k: 6
    lookahead_step: 0
    lr: 0.0001
    n_avg: 0
    step_counter: 0
    weight_decay: 0.0
)
_ = [], profile_name = 'Optimizer.step#SWA.step'

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        self, *_ = args
        profile_name = "Optimizer.step#{}.step".format(self.__class__.__name__)
        with torch.autograd.profiler.record_function(profile_name):
            # call optimizer step pre hooks
>           for pre_hook in chain(_global_optimizer_pre_hooks.values(), self._optimizer_step_pre_hooks.values()):
E           AttributeError: 'SWA' object has no attribute '_optimizer_step_pre_hooks'. Did you mean: '_optimizer_step_code'?

.tox/default/lib/python3.10/site-packages/torch/optim/optimizer.py:2[71](https://github.com/sparks-baird/matsci-opt-benchmarks/actions/runs/5193086890/jobs/9363184385#step:5:72): AttributeError

#70

sgbaird · 2023-06-23T03:03:35Z

For me to address this error without forcing torch < 2.0, I will need to adjust CrabNet to use SWA from the official PyTorch implementation rather than the code that Anthony or Kaai copied (probably from a blog) when developing CrabNet.

Aside: I'm not sure how this will affect hyperparameters in the context of matsci-opt-benchmarks and crabnet-hyperparameter.

Official PyTorch SWA

PyTorch documentation for SWA
Official source code for SWA (optim/swa_utils.py)

Uncited source code within CrabNet

Entirety of https://github.com/sparks-baird/CrabNet/blob/main/crabnet/utils/optim.py (in particular, SWA class)

CrabNet/crabnet/utils/utils.py

Lines 611 to 729 in b9835dc

    
           class Lamb(Optimizer): 
        
               r"""Implements Lamb algorithm. 
        
               It has been proposed in `Large Batch Optimization for Deep Learning: 
        
                   Training BERT in 76 minutes`_. 
        
               Arguments: 
        
                   params (iterable): iterable of parameters to optimize or dicts defining 
        
                       parameter groups 
        
                   lr (float, optional): learning rate (default: 1e-3) 
        
                   betas (Tuple[float, float], optional): coefficients used for computing 
        
                       running averages of gradient and its square (default: (0.9, 0.999)) 
        
                   eps (float, optional): term added to the denominator to improve 
        
                       numerical stability (default: 1e-8) 
        
                   weight_decay (float, optional): weight decay (L2 penalty) (default: 0) 
        
                   adam (bool, optional): always use trust ratio = 1, which turns this 
        
                       into Adam. Useful for comparison purposes. 
        
                   _Large Batch Optimization for Deep Learning: Training BERT in 76 
        
                       minutes: 
        
                   https://arxiv.org/abs/1904.00962 
        
               """ 
        
               def __init__( 
        
                   self, 
        
                   params, 
        
                   lr=1e-3, 
        
                   betas=(0.9, 0.999), 
        
                   eps=1e-6, 
        
                   weight_decay=0, 
        
                   adam=False, 
        
                   min_trust=None, 
        
               ): 
        
                   if not 0.0 <= lr: 
        
                       raise ValueError(f"Invalid learning rate: {lr}") 
        
                   if not 0.0 <= eps: 
        
                       raise ValueError(f"Invalid epsilon value: {eps}") 
        
                   if not 0.0 <= betas[0] < 1.0: 
        
                       raise ValueError(f"Invalid beta parameter at index 0: {betas[0]}") 
        
                   if not 0.0 <= betas[1] < 1.0: 
        
                       raise ValueError(f"Invalid beta parameter at index 1: {betas[1]}") 
        
                   if min_trust and not 0.0 <= min_trust < 1.0: 
        
                       raise ValueError(f"Minimum trust range from 0 to 1: {min_trust}") 
        
                   defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) 
        
                   self.adam = adam 
        
                   self.min_trust = min_trust 
        
                   super(Lamb, self).__init__(params, defaults) 
        
               def step(self, closure=None): 
        
                   """Performs a single optimization step. 
        
                   Arguments: 
        
                       closure (callable, optional): A closure that reevaluates the model 
        
                           and returns the loss. 
        
                   """ 
        
                   loss = None 
        
                   if closure is not None: 
        
                       loss = closure() 
        
                   for group in self.param_groups: 
        
                       for p in group["params"]: 
        
                           if p.grad is None: 
        
                               continue 
        
                           grad = p.grad.data 
        
                           if grad.is_sparse: 
        
                               err_msg = ( 
        
                                   "Lamb does not support sparse gradients, " 
        
                                   + "consider SparseAdam instad." 
        
                               ) 
        
                               raise RuntimeError(err_msg) 
        
                           state = self.state[p] 
        
                           # State initialization 
        
                           if len(state) == 0: 
        
                               state["step"] = 0 
        
                               # Exponential moving average of gradient values 
        
                               state["exp_avg"] = torch.zeros_like(p.data) 
        
                               # Exponential moving average of squared gradient values 
        
                               state["exp_avg_sq"] = torch.zeros_like(p.data) 
        
                           exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] 
        
                           beta1, beta2 = group["betas"] 
        
                           state["step"] += 1 
        
                           # Decay the first and second moment running average coefficient 
        
                           # m_t 
        
                           exp_avg.mul_(beta1).add_((1 - beta1) * grad) 
        
                           # v_t 
        
                           # exp_avg_sq.mul_(beta2).addcmul_((1 - beta2) * grad * 
        
                           exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=(1 - beta2)) 
        
                           # Paper v3 does not use debiasing. 
        
                           # bias_correction1 = 1 - beta1 ** state['step'] 
        
                           # bias_correction2 = 1 - beta2 ** state['step'] 
        
                           # Apply bias to lr to avoid broadcast. 
        
                           step_size = group[ 
        
                               "lr" 
        
                           ]  # * math.sqrt(bias_correction2) / bias_correction1 
        
                           weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10) 
        
                           adam_step = exp_avg / exp_avg_sq.sqrt().add(group["eps"]) 
        
                           if group["weight_decay"] != 0: 
        
                               adam_step.add_(group["weight_decay"], p.data) 
        
                           adam_norm = adam_step.pow(2).sum().sqrt() 
        
                           if weight_norm == 0 or adam_norm == 0: 
        
                               trust_ratio = 1 
        
                           else: 
        
                               trust_ratio = weight_norm / adam_norm 
        
                           if self.min_trust: 
        
                               trust_ratio = max(trust_ratio, self.min_trust) 
        
                           state["weight_norm"] = weight_norm 
        
                           state["adam_norm"] = adam_norm 
        
                           state["trust_ratio"] = trust_ratio 
        
                           if self.adam: 
        
                               trust_ratio = 1 
        
                           p.data.add_(-step_size * trust_ratio * adam_step) 
        
                   return loss

CrabNet/crabnet/utils/utils.py

Lines 732 to 816 in b9835dc

    
           class Lookahead(Optimizer): 
        
               def __init__(self, base_optimizer, alpha=0.5, k=6): 
        
                   if not 0.0 <= alpha <= 1.0: 
        
                       raise ValueError(f"Invalid slow update rate: {alpha}") 
        
                   if not 1 <= k: 
        
                       raise ValueError(f"Invalid lookahead steps: {k}") 
        
                   defaults = dict(lookahead_alpha=alpha, lookahead_k=k, lookahead_step=0) 
        
                   self.base_optimizer = base_optimizer 
        
                   self.param_groups = self.base_optimizer.param_groups 
        
                   self.defaults = base_optimizer.defaults 
        
                   self.defaults.update(defaults) 
        
                   self.state = defaultdict(dict) 
        
                   # manually add our defaults to the param groups 
        
                   for name, default in defaults.items(): 
        
                       for group in self.param_groups: 
        
                           group.setdefault(name, default) 
        
               def update_slow(self, group): 
        
                   for fast_p in group["params"]: 
        
                       if fast_p.grad is None: 
        
                           continue 
        
                       param_state = self.state[fast_p] 
        
                       if "slow_buffer" not in param_state: 
        
                           param_state["slow_buffer"] = torch.empty_like(fast_p.data) 
        
                           param_state["slow_buffer"].copy_(fast_p.data) 
        
                       slow = param_state["slow_buffer"] 
        
                       slow.add_(group["lookahead_alpha"] * (fast_p.data - slow)) 
        
                       fast_p.data.copy_(slow) 
        
               def sync_lookahead(self): 
        
                   for group in self.param_groups: 
        
                       self.update_slow(group) 
        
               def step(self, closure=None): 
        
                   # assert id(self.param_groups) == id(self.base_optimizer.param_groups) 
        
                   loss = self.base_optimizer.step(closure) 
        
                   for group in self.param_groups: 
        
                       group["lookahead_step"] += 1 
        
                       if group["lookahead_step"] % group["lookahead_k"] == 0: 
        
                           self.update_slow(group) 
        
                   return loss 
        
               def state_dict(self): 
        
                   fast_state_dict = self.base_optimizer.state_dict() 
        
                   slow_state = { 
        
                       (id(k) if isinstance(k, torch.Tensor) else k): v 
        
                       for k, v in self.state.items() 
        
                   } 
        
                   fast_state = fast_state_dict["state"] 
        
                   param_groups = fast_state_dict["param_groups"] 
        
                   return { 
        
                       "state": fast_state, 
        
                       "slow_state": slow_state, 
        
                       "param_groups": param_groups, 
        
                   } 
        
               def load_state_dict(self, state_dict): 
        
                   fast_state_dict = { 
        
                       "state": state_dict["state"], 
        
                       "param_groups": state_dict["param_groups"], 
        
                   } 
        
                   self.base_optimizer.load_state_dict(fast_state_dict) 
        
                   # We want to restore the slow state, but share param_groups reference 
        
                   # with base_optimizer. This is a bit redundant but least code 
        
                   slow_state_new = False 
        
                   if "slow_state" not in state_dict: 
        
                       print("Loading state_dict from optimizer without Lookahead applied.") 
        
                       state_dict["slow_state"] = defaultdict(dict) 
        
                       slow_state_new = True 
        
                   slow_state_dict = { 
        
                       "state": state_dict["slow_state"], 
        
                       "param_groups": state_dict[ 
        
                           "param_groups" 
        
                       ],  # this is pointless but saves code 
        
                   } 
        
                   super(Lookahead, self).load_state_dict(slow_state_dict) 
        
                   self.param_groups = ( 
        
                       self.base_optimizer.param_groups 
        
                   )  # make both ref same container 
        
                   if slow_state_new: 
        
                       # reapply defaults to catch missing lookahead specific ones 
        
                       for name, default in self.defaults.items(): 
        
                           for group in self.param_groups: 
        
                               group.setdefault(name, default)

Maybe Lamb is from https://github.com/cybertronai/pytorch-lamb/blob/master/pytorch_lamb/lamb.py?

Maybe Lookahead is from https://github.com/lonePatient/lookahead_pytorch/blob/master/optimizer.py?

Places that require refactoring

... or at least a double check. This might not be comprehensive, but it's a rough idea.

CrabNet/crabnet/crabnet_.py

Lines 372 to 391 in b9835dc

    
           base_optim = Lamb( 
        
               params=self.model.parameters(), 
        
               lr=self.lr, 
        
               betas=self.betas, 
        
               eps=self.eps, 
        
               weight_decay=self.weight_decay, 
        
               adam=self.adam, 
        
               min_trust=self.min_trust, 
        
           ) 
        
           optimizer = Lookahead(base_optimizer=base_optim, alpha=self.alpha, k=self.k) 
        
           self.optimizer = SWA(optimizer) 
        
           lr_scheduler = CyclicLR( 
        
               self.optimizer, 
        
               base_lr=self.base_lr, 
        
               max_lr=self.max_lr, 
        
               cycle_momentum=False, 
        
               step_size_up=self.stepsize, 
        
           ) 
        
           self.lr_scheduler = lr_scheduler

CrabNet/crabnet/crabnet_.py

Lines 395 to 401 in b9835dc

    
           self.stepping = True 
        
           self.swa_start = 2  # start at (n/2) cycle (lr minimum) 
        
           self.xswa: List[int] = [] 
        
           self.yswa: List[float] = [] 
        
           self.lr_list: List[float] = [] 
        
           self.discard_n = 3

CrabNet/crabnet/crabnet_.py

Line 409 in b9835dc

self.lr_list.append(self.optimizer.param_groups[0]["lr"])

CrabNet/crabnet/crabnet_.py

Lines 423 to 432 in b9835dc

    
               if self.optimizer.discard_count >= self.discard_n: 
        
                   if self.verbose: 
        
                       print( 
        
                           f"Discarded: {self.optimizer.discard_count}/{self.discard_n}weight updates, early-stopping now" 
        
                       ) 
        
                   self.optimizer.swap_swa_sgd() 
        
                   break 
        
           if not (self.optimizer.discard_count >= self.discard_n): 
        
               self.optimizer.swap_swa_sgd()

CrabNet/crabnet/crabnet_.py

Lines 627 to 633 in b9835dc

    
           self.optimizer.step() 
        
           self.optimizer.zero_grad() 
        
           if self.stepping: 
        
               self.lr_scheduler.step() 
        
           # hyperparameter updates 
        
           swa_check = self.epochs_step * self.swa_start - 1

CrabNet/crabnet/crabnet_.py

Lines 651 to 661 in b9835dc

    
                   self.optimizer.update_swa(mae_v) 
        
                   minima.append(self.optimizer.minimum_found) 
        
           if learning_time and not any(minima): 
        
               self.optimizer.discard_count += 1 
        
               if self.verbose: 
        
                   print(f"Epoch {self.epoch} failed to improve.") 
        
                   print( 
        
                       f"Discarded: {self.optimizer.discard_count}/" 
        
                       f"{self.discard_n} weight updates" 
        
                   )

CrabNet/crabnet/crabnet_.py

Line 1131 in b9835dc

base_optim = Lamb(params=self.model.parameters())
CrabNet/crabnet/crabnet_.py

Line 818 in b9835dc

if epoch == epochs - 1 or self.optimizer.discard_count >= self.discard_n:

cc @anthony-wang @Kaaiian

Related to sparks-baird/CrabNet#70 Maybe related to #149

cseeg · 2023-07-27T06:01:38Z

What version of pytorch was usually used?

sgbaird · 2023-07-27T14:38:07Z

Probably just the latest version before 2.0. You shouldn't have to do anything fancy when you're installing locally on your computer, since it's already written into the latest requirements.

sgbaird · 2023-11-01T14:29:01Z

From #73 (reply in thread) by @DavidSiretMarques

I'm using pytorch for other things too, so that isn't an option...

That will make things tough. While not ideal, have you tried forcing installation of PyTorch < 2.0 and seeing if that breaks things in your other code?

Do you know the cause of the problem?
I've dedicated some time to this, but only enough to identify the places that require refactoring and some potential solutions.

It was weird that in my old mac computer it worked and in my newer windows computer didn't..., they should have the exact same env with the same libraries...
That's pretty interesting. I'm not sure why that would happen, though.

sgbaird added the bug Something isn't working label Jun 7, 2023

sgbaird self-assigned this Jun 7, 2023

sgbaird mentioned this issue Jun 19, 2023

LLVM error on CPU sparks-baird/mat_discover#149

Open

sgbaird added a commit that referenced this issue Jun 19, 2023

add note about pytorch version

b9835dc

#70

sgbaird added a commit to sparks-baird/mat_discover that referenced this issue Jun 23, 2023

try specifying torch < 2

1a94263

Related to sparks-baird/CrabNet#70 Maybe related to #149

sgbaird mentioned this issue Nov 16, 2023

Add SWA to PyTorch mainline pytorch/pytorch#35032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch probably needs to be less than 2.0 #70

Torch probably needs to be less than 2.0 #70

sgbaird commented Jun 7, 2023 •

edited

Loading

sgbaird commented Jun 23, 2023 •

edited

Loading

cseeg commented Jul 27, 2023

sgbaird commented Jul 27, 2023

sgbaird commented Nov 1, 2023

Torch probably needs to be less than 2.0 #70

Torch probably needs to be less than 2.0 #70

Comments

sgbaird commented Jun 7, 2023 • edited Loading

sgbaird commented Jun 23, 2023 • edited Loading

Official PyTorch SWA

Uncited source code within CrabNet

Places that require refactoring

cseeg commented Jul 27, 2023

sgbaird commented Jul 27, 2023

sgbaird commented Nov 1, 2023

sgbaird commented Jun 7, 2023 •

edited

Loading

sgbaird commented Jun 23, 2023 •

edited

Loading