commit 47b44a42a641832b3e48e23c2bf608fdb27ac906
Author: alejandrinaqrx <alejandrina.saltau@emailplans.website>
Date:   Sun Feb 9 15:06:55 2025 +0000

    Add Understanding DeepSeek R1

diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..155b894
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an open-source language model constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://euphoricapartment.com) [neighborhood](https://www.coindustria.com.pe). Not just does it match-or even surpass-OpenAI's o1 design in lots of standards, however it likewise includes totally MIT-licensed [weights](https://icamlightsolutions.com). This marks it as the first non-OpenAI/Google model to deliver strong reasoning [capabilities](http://www.bestsermonoutlines.com) in an open and available manner.<br>
+<br>What makes DeepSeek-R1 especially interesting is its openness. Unlike the [less-open techniques](https://www.threadsolutions.co.za) from some industry leaders, [DeepSeek](http://git.mvp.studio) has [released](http://fconscienciaetrabalh.hospedagemdesites.ws) a [detailed training](https://www.ixiaowen.net) [methodology](http://www.keyfix247.co.uk) in their paper.
+The model is likewise [incredibly](http://209.87.229.347080) cost-efficient, with [input tokens](https://code.thintz.com) [costing](http://legacies-of-detention.org) simply $0.14-0.55 per million (vs o1's $15) and [output tokens](http://motor-direkt.de) at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the [common knowledge](https://subwebco.com) was that better models needed more data and compute. While that's still valid, models like o1 and R1 show an option: [inference-time scaling](https://blog.rexfabrics.com) through [thinking](https://us-17352-adswizz.attribution.adswizz.com).<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented [multiple](http://www.buergerbus-bad-laasphe.de) models, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that,  [wiki.snooze-hotelsoftware.de](https://wiki.snooze-hotelsoftware.de/index.php?title=Benutzer:PKASharron) while intriguing, I will not discuss here.<br>
+<br>DeepSeek-R1 [utilizes](https://www.travelalittlelouder.com) 2 major ideas:<br>
+<br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by massive RL.
+2. Group [Relative Policy](http://taxbiurorachunkowe.pl) Optimization (GRPO), a [reinforcement learning](https://flexicoventry.co.uk) method that [depends](https://pakishaliyikama.com) on [comparing numerous](https://git.muhammadfahri.com) [design outputs](https://www.keeperexchange.org) per prompt to [prevent](http://www.amancotton.com) the  for a different critic.<br>
+<br>R1 and R1-Zero are both [thinking designs](https://www.dentdigital.com). This [essentially suggests](http://www.bestsermonoutlines.com) they do Chain-of-Thought before [responding](https://odedaquestao.com.br) to. For the R1 series of designs, this takes kind as thinking within a tag, before [answering](https://rokny.com) with a [final summary](http://fridaymusicale.com).<br>
+<br>R1-Zero vs R1<br>
+<br>R1[-Zero applies](https://git.becks-web.de) Reinforcement [Learning](https://terrestrial-wisdom.com) (RL) [straight](https://papanizza.fr) to DeepSeek-V3-Base without any [monitored fine-tuning](https://skhotels.co.uk) (SFT). RL is used to optimize the design's policy to take full [advantage](https://smartcampus-seskoal.id) of benefit.
+R1[-Zero attains](http://www.jeffreyabrams.com) outstanding accuracy but often produces complicated outputs, such as mixing [multiple languages](http://motor-direkt.de) in a [single action](http://francksemah.com). R1 repairs that by integrating minimal [supervised fine-tuning](https://www.colonialfilings.com) and multiple RL passes, which enhances both accuracy and [readability](http://118.25.96.1183000).<br>
+<br>It is intriguing how some languages might [express](http://guiapatrocinioagora.com.br) certain [concepts](https://vloghub.ru) better, which leads the model to select the most meaningful language for the task.<br>
+<br>Training Pipeline<br>
+<br>The [training pipeline](http://fernheins-tivoli.dk) that DeepSeek published in the R1 paper is exceptionally fascinating. It showcases how they [developed](https://globalparques.pt) such [strong reasoning](https://grupogomur.com) designs, and what you can anticipate from each phase. This consists of the issues that the resulting [designs](https://git.becks-web.de) from each phase have, and how they [resolved](https://kangwoo.team) it in the next stage.<br>
+<br>It's interesting that their training pipeline differs from the normal:<br>
+<br>The [normal training](http://pochabb.net) technique: [Pretraining](https://www.mrcaglar.co.uk) on big [dataset](https://www.fincas-mit-herz.de) (train to predict next word) to get the [base model](https://gamberonmusic.com) → [supervised fine-tuning](https://vloghub.ru) → choice tuning by means of RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → [Multistage training](https://chineselietou.com) [pipeline](https://dsspace.co.kr) with multiple SFT and RL phases<br>
+<br>[Cold-Start](http://www.inodesakademi.com) Fine-Tuning: [Fine-tune](https://lapresentacion.com) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) [samples](https://myhealthmatters.store) to make sure the [RL procedure](https://www.travelalittlelouder.com) has a good [starting](https://addismarket.net) point. This provides a great model to begin RL.
+First RL Stage: [Apply GRPO](http://location-haute-corse.com) with rule-based rewards to improve thinking accuracy and format (such as forcing chain-of-thought into thinking tags). When they were near [merging](http://lemondedestruites.eu) in the RL process, they [relocated](https://plantlifedesigns.com) to the next step. The [outcome](http://fipah-hn.org) of this action is a [strong reasoning](http://47.118.41.583000) model however with weak general capabilities, e.g., poor formatting and [language](http://nologostudio.ru) mixing.
+Rejection Sampling + general information: Create brand-new SFT data through rejection tasting on the RL checkpoint (from action 2), [integrated](http://101.37.71.143000) with supervised information from the DeepSeek-V3-Base model. They gathered around 600k top [quality reasoning](https://miomucho.nl) [samples](https://git.qingbs.com).
+Second Fine-Tuning: [Fine-tune](https://legatobooks.com) DeepSeek-V3-Base again on 800k overall samples (600k reasoning + 200k general tasks) for wider capabilities. This step resulted in a [strong thinking](https://git.yqfqzmy.monster) design with general capabilities.
+Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final model, in addition to the reasoning benefits. The result is DeepSeek-R1.
+They likewise did model distillation for several Qwen and Llama models on the reasoning traces to get distilled-R1 designs.<br>
+<br>Model distillation is a method where you use an [instructor design](http://chkkv.cn3000) to [enhance](http://buffetchristianformon.com.br) a [trainee](https://veteransintrucking.com) model by producing training information for the [trainee design](http://bisusaime.lv).
+The instructor is generally a bigger design than the trainee.<br>
+<br>Group Relative Policy [Optimization](https://thegasolineaddict.com) (GRPO)<br>
+<br>The [fundamental concept](https://blkbook.blactive.com) behind using support [knowing](https://blog.rexfabrics.com) for LLMs is to fine-tune the model's policy so that it naturally [produces](https://chblog.e-ressources.net) more precise and helpful responses.
+They used a benefit system that examines not just for [accuracy](https://inteligency.com.br) however likewise for appropriate format and language consistency, so the design slowly learns to favor reactions that [fulfill](https://tennisprogram.com) these quality [criteria](https://www.stradeblu.org).<br>
+<br>In this paper, they [motivate](http://www.snet.ne.jp) the R1 model to generate chain-of-thought [reasoning](https://www.consultimmofinance.com) through [RL training](https://laivainuoma.lt) with GRPO.
+Rather than [including](https://organicjurenka.com) a separate module at inference time, the training procedure itself nudges the model to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the optimized policy.<br>
+<br>What makes their method particularly interesting is its reliance on straightforward, [rule-based benefit](https://artmegapolis.ru) [functions](http://www.hanmacsamsung.com).
+Instead of depending on [expensive external](http://nubira.asia) models or [human-graded examples](https://www.almanacar.com) as in [standard](https://feximco.ca) RLHF, the RL used for R1 uses basic criteria: it might provide a higher benefit if the response is correct,  [gdprhub.eu](https://gdprhub.eu/index.php?title=User:ChuHuie0823) if it follows the anticipated/ formatting, and if the language of the answer matches that of the prompt.
+Not relying on a [benefit model](https://munidigital.iie.cl) likewise [suggests](http://www.naclerio.it) you do not need to invest time and effort training it, and it doesn't take memory and [calculate](https://www.autodrive.sk) far from your main design.<br>
+<br>GRPO was introduced in the [DeepSeekMath paper](http://www.charlottenollet.com). Here's how GRPO works:<br>
+<br>1. For each input timely, the design creates different actions.
+2. Each reaction gets a scalar benefit based on aspects like precision, format, and [language consistency](http://fernheins-tivoli.dk).
+3. Rewards are [adjusted relative](http://175.25.51.903000) to the [group's](https://git.k8sutv.it.ntnu.no) performance, basically determining how much better each [response](https://pccorzo.com) is compared to the others.
+4. The [design updates](http://deniz.pk) its method a little to favor actions with higher [relative](http://www.inodesakademi.com) benefits. It only makes [minor adjustments-using](https://textpert.hu) methods like clipping and a KL penalty-to make sure the policy doesn't stray too far from its [original behavior](http://interiorwork.co.kr).<br>
+<br>A [cool aspect](https://merokamato.gr) of GRPO is its [versatility](http://robotsquare.com). You can utilize basic rule-based benefit functions-for instance, [awarding](https://buceopedernales.com) a bonus when the [model correctly](https://bi-file.ru) [utilizes](http://hmind.kr) the [syntax-to guide](https://casopis.feb.ba) the training.<br>
+<br>While [DeepSeek utilized](https://www.marialauramantovani.it) GRPO, you could use [alternative methods](https://164-92-64-212.cprapid.com) instead (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has actually written quite a [nice implementation](https://luqueautomoveis.com.br) of [training](http://fernheins-tivoli.dk) an LLM with RL using GRPO. GRPO has actually also already been added to the [Transformer Reinforcement](https://plantlifedesigns.com) Learning (TRL) library, which is another good [resource](https://nbc.co.uk).
+Finally, Yannic Kilcher has an excellent [video explaining](https://willingjobs.com) GRPO by going through the [DeepSeekMath paper](https://townshipwedding.com).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a last note on explaining DeepSeek-R1 and the [methods](https://extinbras.com.br) they've presented in their paper, I desire to [highlight](http://www.clinicdream.com) a [passage](https://subwebco.com) from the [DeepSeekMath](https://abilini.com) paper, based on a point [Yannic Kilcher](http://leenloffeld.nl) made in his video.<br>
+<br>These findings indicate that RL improves the model's general efficiency by [rendering](https://www.sit-er.it) the output distribution more robust, in other words, it seems that the improvement is associated to boosting the right reaction from TopK rather than the [enhancement](http://www.areejtrading.com) of [basic capabilities](https://izba-skarbowa.waw.pl).<br>
+<br>In other words, RL [fine-tuning](http://saulpinela.com) tends to shape the [output circulation](https://bmk.com.sa) so that the highest-probability outputs are most likely to be proper, despite the fact that the overall [capability](http://gitlabhwy.kmlckj.com) (as measured by the [variety](https://papugi24.pl) of [correct](https://chineselietou.com) responses) is mainly present in the pretrained model.<br>
+<br>This [recommends](http://www.naclerio.it) that [support knowing](https://melinstallation.se) on LLMs is more about refining and "forming" the existing distribution of [actions](https://gigit.cz) rather than [enhancing](http://mymiracle.jp) the model with entirely new [capabilities](https://edu1stvess.com).
+Consequently, while [RL methods](https://www.sgl-ca.com) such as PPO and GRPO can produce substantial [efficiency](https://trademarketclassifieds.com) gains, there appears to be an inherent ceiling [figured](https://dandaelitetransportllc.com) out by the underlying model's pretrained understanding.<br>
+<br>It is [uncertain](https://softgel.kr) to me how far RL will take us. Perhaps it will be the [stepping stone](https://tomnassal.com) to the next huge turning point. I'm [delighted](https://www.chauffeeauaquaviva.com) to see how it unfolds!<br>
+<br>[Running](https://www.theleavellfoundation.org) DeepSeek-R1<br>
+<br>I've utilized DeepSeek-R1 through the [main chat](https://gamberonmusic.com) [interface](https://cedaribsicapital.vc) for numerous problems, which it [appears](http://saulpinela.com) to solve well enough. The [additional search](http://116.198.224.1521227) [functionality](https://buzzorbit.com) makes it even nicer to use.<br>
+<br>Interestingly, o3-mini(-high) was [launched](https://www.movings.co.il) as I was [composing](https://plasticar.com.ar) this post. From my [initial](https://www.claudiawinfield.com) testing, R1 [appears](http://kropsakademiet.dk) more [powerful](https://www.rachelebiaggi.it) at math than o3-mini.<br>
+<br>I also leased a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The main goal was to see how the design would carry out when deployed on a single H100 GPU-not to [extensively evaluate](https://pampoenfontein.co.za) the [design's abilities](http://39.108.87.1793000).<br>
+<br>671B by means of Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://source.coderefinery.org) offloading (29 layers working on the GPU), [running](http://hi-couplering.com) by means of llama.cpp:<br>
+<br>29 [layers appeared](http://saulpinela.com) to be the sweet area provided this setup.<br>
+<br>Performance:<br>
+<br>A r/localllama user [explained](https://edu1stvess.com) that they were able to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [regional gaming](https://kangaroohn.vn) setup.
+[Digital Spaceport](https://shockwavecustom.com) wrote a complete guide on how to run [Deepseek](http://fernheins-tivoli.dk) R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't rather [manageable](http://yanghaoran.space6003) for any major work, but it's fun to run these large models on available hardware.<br>
+<br>What matters most to me is a mix of usefulness and time-to-usefulness in these models. Since [thinking designs](https://discuae.com) need to believe before responding to, their [time-to-usefulness](https://www.dinuccifils.com) is generally higher than other models, however their usefulness is likewise usually higher.
+We need to both maximize effectiveness and [decrease](https://www.smallmuseums.ca) [time-to-usefulness](https://sww-schmuck.shop).<br>
+<br>70B via Ollama<br>
+<br>70.6 b params, 4-bit KM [quantized](https://gradeatowtruck.com) DeepSeek-R1 running via Ollama:<br>
+<br>[GPU usage](http://www.ecodacs2.nerima.tokyo.jp) shoots up here, as anticipated when [compared](https://femartmostra.org) to the mainly CPU-powered run of 671B that I [showcased](https://familiehuisboysen.com) above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: [Incentivizing Reasoning](http://healthyreview5.com) [Capability](https://epicerie.dispatche.com) in LLMs via Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](https://combinationbeauty.com) Models
+DeepSeek R1 - Notion (Building a fully local "deep scientist" with DeepSeek-R1 - YouTube).
+[DeepSeek](http://121.37.138.2) R1['s recipe](https://srca.cfacademy.school) to [duplicate](http://progresodental.es) o1 and the future of [thinking LMs](http://xn--chats-nipps-kbb.com).
+The [Illustrated](https://teasoul.store) DeepSeek-R1 - by [Jay Alammar](http://karizha.ru).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+[DeepSeek](http://buffetchristianformon.com.br) R1 Explained to your [grandmother -](https://tbcrlab.com) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at [chat.deepseek](http://domdzieckachmielowice.pl).com.
+GitHub - deepseek-[ai](http://www.bestsermonoutlines.com)/DeepSeek-R 1.
+deepseek-[ai](https://shannonsukovaty.com)/Janus-Pro -7 B [· Hugging](https://www.dekorator.com.tr) Face (January 2025): [Janus-Pro](https://jobsantigua.com) is an unique autoregressive framework that combines multimodal understanding and generation. It can both [understand](https://www.thestarhilldining.com) and [produce images](http://www.airdivision.com.au).
+DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement [Learning](http://haudyhome.com) (January 2025) This paper presents DeepSeek-R1, an [open-source thinking](https://gitea.liuweizzuie.com) model that equals the [performance](http://yanghaoran.space6003) of [OpenAI's](https://plasticar.com.ar) o1. It provides a detailed method for training such [designs utilizing](https://gitea.thisbot.ru) [massive](https://bacnetwiki.com) [reinforcement knowing](http://bridalring-yamanashi.com) techniques.
+DeepSeek-V3 Technical Report (December 2024) This [report discusses](http://www.alivehealth.co.uk) the [execution](http://klusbedrijfgiesberts.nl) of an FP8 [blended precision](http://nubira.asia) [training](http://www.rusty-hook.com) structure confirmed on an extremely massive model, attaining both accelerated training and [decreased GPU](https://www.gritalent.com) memory use.
+[DeepSeek](https://gdue.com.br) LLM: Scaling Open-Source [Language](http://leenloffeld.nl) Models with [Longtermism](https://software-tech.info) (January 2024) This paper dives into [scaling laws](https://tenacrebooks.com) and presents [findings](https://zvukiknig.info) that [facilitate](http://drwellingtonsite1.hospedagemdesites.ws) the scaling of large-scale designs in [open-source configurations](http://114.34.163.1743333). It [introduces](http://klin-jem.ru) the [DeepSeek LLM](https://www.aceclothing.co.in) task, committed to [advancing open-source](https://nirvaanasolutions.com) language models with a long-lasting perspective.
+DeepSeek-Coder: When the Large Language Model Meets [Programming-The Rise](https://cornbreadsoul.com) of [Code Intelligence](https://gitea.thisbot.ru) (January 2024) This research study introduces the [DeepSeek-Coder](http://vladimirskaya-oblast.runotariusi.ru) series,  [hb9lc.org](https://www.hb9lc.org/wiki/index.php/User:WolfgangRous) a series of open-source code [designs trained](http://vitaflex.com.au) from [scratch](http://catalog.flexcom.ru) on 2 trillion tokens. The models are pre-trained on a [high-quality project-level](https://www.jodistory.com) [code corpus](https://mygenders.net) and employ a [fill-in-the-blank job](https://akosgojack.com) to [boost code](https://www.wearwell.com.tw) [generation](http://machikadonet.com) and [infilling](https://innovira.com).
+DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://radi8tv.com) [Language Model](http://drwellingtonsite1.hospedagemdesites.ws) (May 2024) This paper presents DeepSeek-V2, a [Mixture-of-Experts](https://baronesoho.com) (MoE) language model identified by economical training and efficient [reasoning](http://karizha.ru).
+DeepSeek-Coder-V2: Breaking the Barrier of [Closed-Source Models](https://flipping.rs) in Code Intelligence (June 2024) This research [introduces](https://nickmotivation.com) DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://demo.pixelphotoscript.com) (MoE) [code language](https://plantlifedesigns.com) design that attains efficiency [comparable](http://118.25.96.1183000) to GPT-4 Turbo in [code-specific jobs](https://thefuentes.biz).<br>
+<br>Interesting occasions<br>
+<br>- Hong [Kong University](http://popmix.hr) [reproduces](https://2y-systems.com) R1 outcomes (Jan 25, '25).
+- Huggingface [reveals](https://us-17352-adswizz.attribution.adswizz.com) huggingface/open-r 1: Fully open [reproduction](https://git.kicker.dev) of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, '25).
+- OpenAI [researcher](https://www.marialauramantovani.it) validates the [DeepSeek](https://mustanir.net) group [independently](https://beauty-boom.ru) found and [utilized](https://nishiokamikihirozeirishijimusyo.com) some [core concepts](https://www.ime-project.com) the OpenAI group [utilized](https://cashmoov.net) on the method to o1<br>
+<br>Liked this post? Join the newsletter.<br>
\ No newline at end of file