discussion 2024-07-18
Summary​
In the Discord chat, Samuel sought assistance with an issue he encountered while training a universal transformer on a synthetic dataset using AdamW optimizer without any scheduler. He observed that his model performed well when employing layer normalization but failed to learn when RMSNorm was used instead. Dredd joined the server during this discussion, although no immediate solutions or insights were provided by other participants in response to Samuel's problem.
FAQ​
-
What happens when using RMSNorm instead of LayerNorm in training an universal transformer with post layer normalization on a synthetic dataset?
-
[SSamuel]: When using RMSNorm, the model doesn't learn as effectively compared to using LayerNorm. This issue was observed while using AdamW optimizer without any scheduler.
-
Are there possible solutions or insights for when an universal transformer fails to learn with RMSNorm?
- [SSamuel]: The question is open, and no specific solution has been provided yet in the chat. However, it's suggested that further investigation into the differences between LayerNorm and RMSNorm might be helpful for understanding this issue better.
Who Helped Who​
- Dredd helped Samuel with his issue regarding RMSNorm not working as expected by joining the server, indicating a willingness to assist and potentially provide insights or solutions.
Action Items​
Technical Tasks:
- Investigate the issue with RMSNorm not allowing the model to learn (mentioned by SSamuel)
Documentation Needs: No documentation needs were explicitly requested in this chat transcript.
Feature Requests: No feature requests were suggested or discussed in this chat transcript.
Community Tasks:
- Provide insights, experiences, and possible solutions for the issue with RMSNorm (requested by SSamuel)