-
Notifications
You must be signed in to change notification settings - Fork 4k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
run evaluate_cogagent_demo.py error from sat import mpu, get_args, get_tokenizer error
bug
Something isn't working
inference
#5803
opened Jul 27, 2024 by
gyjlll
How to silence warning when import deepseed without change the source code?
enhancement
New feature or request
#5801
opened Jul 25, 2024 by
sean-wade
[BUG]RuntimeError: disagreement between rank0 and rank1: rank0:
bug
Something isn't working
compression
#5799
opened Jul 24, 2024 by
yiyepiaoling0715
[BUG] Expert parallel hangs at the last MoE layer
bug
Something isn't working
training
#5794
opened Jul 23, 2024 by
JessePrince
[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed
bug
Something isn't working
inference
#5793
opened Jul 23, 2024 by
gawain000000
[BUG] pipline engine's training stucked when zero=1
bug
Something isn't working
compression
#5792
opened Jul 23, 2024 by
janelu9
[REQUEST]How to set Ulysses in deepspeed config json?
enhancement
New feature or request
#5787
opened Jul 22, 2024 by
xs1997zju
does deepspeed support pure bf16 training?
enhancement
New feature or request
#5784
opened Jul 21, 2024 by
hjc3613
how to set "training_step" during training?
bug
Something isn't working
training
#5779
opened Jul 17, 2024 by
qwerfdsadad
[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
bug
Something isn't working
training
#5776
opened Jul 17, 2024 by
exnx
[BUG] Multi-node fine-tuning with thunderbolt
bug
Something isn't working
training
#5766
opened Jul 11, 2024 by
Raywang0211
[BUG] Multi-gpu stuck when the computation graph is not complete for wach process.
bug
Something isn't working
training
#5762
opened Jul 10, 2024 by
gary-young
[BUG] I can't run fp8 with pipeline parallel
bug
Something isn't working
training
#5760
opened Jul 10, 2024 by
exnx
[BUG] Learning rate scheduler and optimizer logical issue
bug
Something isn't working
training
#5731
opened Jul 5, 2024 by
zhourunlong
lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to Something isn't working
training
deepspeed.initialize
[BUG]
bug
#5726
opened Jul 5, 2024 by
xiyang-aads-lilly
[BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
bug
Something isn't working
training
#5723
opened Jul 4, 2024 by
PhysicianHOYA
[REQUEST] Asynchronous Checkpointing
enhancement
New feature or request
#5721
opened Jul 2, 2024 by
zaptrem
Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer
training
#5719
opened Jul 2, 2024 by
yutanozaki1
Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG]
bug
Something isn't working
training
#5717
opened Jul 2, 2024 by
selenerkan
[REQUEST] Does Universal Checkpoint supports for MoE Checkpoint?
enhancement
New feature or request
#5716
opened Jul 2, 2024 by
tiggerwu
[BUG] localhost: Permission denied, please try again. with single node and multi-gpus with --autotuning run
bug
Something isn't working
training
#5709
opened Jul 1, 2024 by
Looong01
[BUG] 1-bit LAMB not compatible with bf16
bug
Something isn't working
training
#5708
opened Jun 28, 2024 by
catid
on Activation Checkpointing
bug
Something isn't working
training
#5704
opened Jun 28, 2024 by
ChaunceyWang
[BUG] Mixed-precision: fp16 will cast input_ids into torch.cuda.HalfTensor instead of Long or Int.
#5701
opened Jun 28, 2024 by
zhaoyang02
Previous Next
ProTip!
Add no:assignee to see everything that’s not assigned.