microsoft / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4k
Star 34.1k

Code
Issues 998
Pull requests 140
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/DeepSpeed

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

998 Open 1,660 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

run evaluate_cogagent_demo.py error from sat import mpu, get_args, get_tokenizer error bug

Something isn't working

inference

#5803 opened Jul 27, 2024 by gyjlll

How to silence warning when import deepseed without change the source code? enhancement

New feature or request

#5801 opened Jul 25, 2024 by sean-wade

[BUG]RuntimeError: disagreement between rank0 and rank1: rank0: bug

Something isn't working

compression

#5799 opened Jul 24, 2024 by yiyepiaoling0715

[BUG] Expert parallel hangs at the last MoE layer bug

Something isn't working

training

#5794 opened Jul 23, 2024 by JessePrince

[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed bug

Something isn't working

inference

#5793 opened Jul 23, 2024 by gawain000000

[BUG] pipline engine's training stucked when zero=1 bug

Something isn't working

compression

#5792 opened Jul 23, 2024 by janelu9

[REQUEST]How to set Ulysses in deepspeed config json? enhancement

New feature or request

#5787 opened Jul 22, 2024 by xs1997zju

does deepspeed support pure bf16 training? enhancement

New feature or request

#5784 opened Jul 21, 2024 by hjc3613

how to set "training_step" during training? bug

Something isn't working

training

#5779 opened Jul 17, 2024 by qwerfdsadad

[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there" bug

Something isn't working

training

#5776 opened Jul 17, 2024 by exnx

[BUG] Multi-node fine-tuning with thunderbolt bug

Something isn't working

training

#5766 opened Jul 11, 2024 by Raywang0211

[BUG] Multi-gpu stuck when the computation graph is not complete for wach process. bug

Something isn't working

training

#5762 opened Jul 10, 2024 by gary-young

[BUG] I can't run fp8 with pipeline parallel bug

Something isn't working

training

#5760 opened Jul 10, 2024 by exnx

In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them. training

#5754 opened Jul 7, 2024 by WhaleSpring

[BUG] Learning rate scheduler and optimizer logical issue bug

Something isn't working

training

#5731 opened Jul 5, 2024 by zhourunlong

lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to deepspeed.initialize [BUG] bug

Something isn't working

training

#5726 opened Jul 5, 2024 by xiyang-aads-lilly

[BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory bug

Something isn't working

training

#5723 opened Jul 4, 2024 by PhysicianHOYA

[REQUEST] Asynchronous Checkpointing enhancement

New feature or request

#5721 opened Jul 2, 2024 by zaptrem

Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer training

#5719 opened Jul 2, 2024 by yutanozaki1

Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG] bug

Something isn't working

training

#5717 opened Jul 2, 2024 by selenerkan

[REQUEST] Does Universal Checkpoint supports for MoE Checkpoint？ enhancement

New feature or request

#5716 opened Jul 2, 2024 by tiggerwu

[BUG] localhost: Permission denied, please try again. with single node and multi-gpus with --autotuning run bug

Something isn't working

training

#5709 opened Jul 1, 2024 by Looong01

[BUG] 1-bit LAMB not compatible with bf16 bug

Something isn't working

training

#5708 opened Jun 28, 2024 by catid

on Activation Checkpointing bug

Something isn't working

training

#5704 opened Jun 28, 2024 by ChaunceyWang

[BUG] Mixed-precision: fp16 will cast input_ids into torch.cuda.HalfTensor instead of Long or Int.

#5701 opened Jun 28, 2024 by zhaoyang02

Previous 1 2 3 4 5 … 39 40 Next

Previous Next

ProTip! Add no:assignee to see everything that’s not assigned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly