ÍøÂç¼ÆËã(In-Network
Computing)Êǵ±Ç°¸ßÐÔÄܼÆËãºÍÈ˹¤ÖÇÄÜÁìÓòµÄÇ°ÑØ¿ÎÌâ,ËüÊÇInfiniBandÍøÂçÃæÏòÐÂÒ»´ú·Ö²¼Ê½²¢ÐмÆËãÌåϵ½á¹¹,Ó¦ÓÃÐͬÉè¼ÆÀíÄî,¿ª·¢³öµÄÒ»ÖÖͨÐżÓËÙ¼¼Êõ¡£ÍøÂç¼ÆËãÓÐЧµØ½â¾öÁËAIºÍHPCÓ¦ÓÃÖеļ¯ºÏͨÐź͵ã¶ÔµãÆ¿¾±ÎÊÌâ,ΪÊý¾ÝÖÐÐĵĿÉÀ©Õ¹ÐÔÌṩÁËеÄ˼·ºÍ·½°¸¡£
ËüÀûÓÃÍø¿¨¡¢½»»»»úµÈÍøÂçÉ豸,ÔÚÊý¾Ý´«Êä¹ý³ÌÖÐ,ͬʱ½øÐÐÊý¾ÝµÄÔÚÏß¼ÆËã,ÒÔ´ïµ½½µµÍͨÐÅÑÓ³Ù¡¢ÌáÉýÕûÌå¼ÆËãЧÂʵȡ£³ÉΪºÍGPUºÍCPUͬµÈÖØÒªµÄ¼ÆËãµ¥Ôª¡£
ÔÚ¹ýÈ¥µÄ¼¸Äê,ÒÔÔƼÆËã,´óÊý¾Ý,HPCºÍAIµÈΪ´ú±íµÄÏÖ´úÊý¾ÝÖÐÐÄÈ«Ãæ½ø»¯µ½·Ö²¼Ê½²¢Ðд¦Àí¼Ü¹¹,Êý¾ÝÖÐÐĵÄËùÓÐ×ÊÔ´ÈçCPU,ÄÚ´æ,´æ´¢µÈ·Ö²¼ÔÚÕû¸öÊý¾ÝÖÐÐÄ,ʹÓøßËÙÍøÂç¼¼Êõ°ÑËùÓеÄ×ÊÔ´Á¬½ÓÆðÀ´,ÐͬÉè¼Æ·Ö¹¤ºÏ×÷,¹²Í¬Íê³ÉÊý¾ÝÖÐÐĵÄÊý¾Ý´¦ÀíÈÎÎñ¡£ÔÚÏÖ´úÊý¾ÝÖÐÐÄÄÚ,Ò»ÇÐÒÔÒµÎñÊý¾ÝΪµ¼Ïò,¹¹Ôì¾ùºâµÄϵͳÌåϵ¼Ü¹¹,ÑØ×ÅÒµÎñÊý¾ÝÁ÷¶¯µÄ·½Ïò,CPU¼ÆËã,GPU¼ÆËã,´æ´¢¼ÆËã,ÍøÂç¼ÆËãµÈ¸÷¸ö²¿·ÖºÏ×ÝÁ¬ºá,·Ö½øºÏ»÷,¹²Í¬¹¹³ÉÁËÒÔÊý¾ÝΪºËÐĵÄÐÂÒ»´úÊý¾ÝÖÐÐÄϵͳ¼Ü¹¹¡£
×÷ΪÁ¬½ÓÏÖ´úÊý¾ÝÖÐÐÄËùÓÐ×ÊÔ´µÄ¹Ø¼ü²¿¼þ,ÒÔInfiniBandΪ´ú±íµÄ¸ßËÙÍøÂç¼¼ÊõÈ¡µÃÁËÍ»·ÉÃͽøµÄ·¢Õ¹¡£Ëæ×Å1999ÄêIBTA(InfiniBand
Trade
Association)×éÖ¯³ÉÁ¢²¢·¢²¼InfiniBand¹æ·¶ºó,InfiniBand¸ßËÙÍøÂç¼¼ÊõµÃµ½¿ìËÙ·¢Õ¹²¢ÔÚHPCÁìÓò»ñµÃ¹ã·ºµÄÓ¦Óá£Èç½ñÔÚͨÏòE¼¶¼ÆËãµÄµÀ·ÉÏ,ÍøÂç¼ÆËã¼¼ÊõÕý³ÉΪӰÏìϵͳÕûÌå¾ùºâµÄºËÐļ¼Êõ,³ÉΪ·Ö²¼Ê½¼ÆËãµ¥ÔªµÄÒ»²¿·Ö,ÐͬCPU/GPU¹¹½¨·Ö²¼Ê½¼ÆËã³Ø,Ðͬ´æ´¢¹¹½¨·Ö²¼Ê½µÄÄÚ´æ³Ø¡£
×÷Ϊ¸ßЧµÄ¿ÉÀ©Õ¹ÖÇÄÜ»¥Á¬¼¼Êõ,
InfiniBand½«ÍøÂç¼ÆËã¼¼ÊõÉîÈëÈÚÈëµ½²úƷʵ¼ùÖÐ,ʵÏÖÎ޷켯³É,ͨ¹ýÍøÂç¼ÆËã½â¾öͨÐÅÆ¿¾±µÄÎÊÌâ,ͨ¹ý½«¸÷ÖÖͨÐÅÏà¹ØµÄ¼ÆËã´ÓCPU/GPUжÔص½ÍøÂçÖÐÀ´,½«CPUºÍGPUµÄ×ÊÔ´ÊͷųöÀ´,ÈÃÓ¦ÓûñµÃ¸ü¶àµÄ¼ÆËã×ÊÔ´,´ÓÕûÌåÓ¦ÓõÄÐÔÄÜÉϵÃÒÔÌáÉý,°ïÖúÆóÒµÓ¦¶ÔÊý¾ÝÌôÕ½¡£
¹Ø¼ü¼¼Êõ
1.ÍøÂçÐÒéжÔØ
Ìá¸ßÊý¾ÝµÄ·¢ËͺͽÓÊÕÐÔÄÜÒ»Ö±ÊÇÍøÂç¼ÆËãµÄ»ù´¡¡£Òµ½ç³£¼ûµÄTCP/IPÐÒéÕ»,ʹÓòÙ×÷ϵͳÄÚºËÌṩµÄÈí¼þʵÏÖÐÒéÕ»µÄ´¦Àí¹ý³Ì,ÐèÒª´óÁ¿¼·Õ¼CPUµÄ´¦Àí×ÊÔ´¡£ÔÚÍøÂç´ø¿íÔ½À´Ô½´ó,Ó¦ÓóÌÐò¶ÔÐÔÄÜÒªÇóÔ½À´Ô½¸ßµÄÊý¾ÝÖÐÐÄ,Ô½À´Ô½³ÉΪÕûÌåÐÔÄܵÄÆ¿¾±¡£
InfiniBandÍø¿¨ºÍ½»»»»úͨ¹ýASICÓ²¼þÍêÕûʵÏÖÁËÍøÂçͨÐŵÄÎïÀí²ã¡¢Á´Â·²ã¡¢ÍøÂç²ãºÍ´«Êä²ã,Òò´ËÔÚͨÐŹý³ÌÖÐ,Êý¾ÝÁ÷²»ÐèÒª¶îÍâµÄÈí¼þºÍCPU´¦Àí¿ªÏú,¼«´óµØÌá¸ßÁËͨÐÅÐÔÄÜ¡£
2.Ô¶³ÌÖ±½ÓÄÚ´æ·ÃÎÊ(RDMA)
RDMA(Remote Direct Memory
Access)ÊÇÒ»ÖÖÄÚºËÅÔ·¼¼Êõ,ËüÌṩÁËÍø¿¨Ó²¼þµÄ³éÏó²ãverbs,ÔËÐÐÓû§¿Õ¼äµÄ½ø³ÌÈƹýÄÚºËÖ±½Ó·ÃÎÊRDMAÍø¿¨,ʵÏÖ¸ßЧ¿ìËÙͨѶ¡£RDMAÔ´ÓÚInfiniBandÍøÂç¼¼Êõ,ÓÉIBTA×éÖ¯¶ÔÆä½øÐбê×¼»¯¶¨ÒåºÍά»¤¡£ÔÚÓ²¼þ²ãÃæ,RDMAÓ²¼þÔÚ·¢ËͺͽÓÊÕ¶ËË«·½Ïò¶¼Í¨¹ýDMA¼¼ÊõÖ±½Ó¶ÔÓû§Ì¬ÄÚ´æ½øÐжÁд,ͨ¹ýʹÓÃÍø¿¨ÄÚÖõÄÍøÂçÐÒéжÔØÒýÇæ,ʵÏÖÊý¾Ý´«Êä¹ý³ÌÖеÄÍêÕûÐÒé´¦Àí,ÔÚ¼¸ºõ²»ÏûºÄCPUµÄÇé¿öÏÂ,ʵÏÖ¸ßͨÁ¿µÄÍøÂçͨÐÅ¡£
![alt=](http://image.danews.cc/upload/doc/20200908/5f574821b04e2.png)
3.GPUDirect RDMA
Ä¿Ç°Òµ½ç¹ã·ºÊ¹ÓÃGPU×÷Ϊ¸ßÐÔÄܼÆËã»òÈ˹¤ÖÇÄÜƽ̨µÄ¼ÆËãºËÐÄ,GPUÖ®¼äµÄͨÐÅÐÔÄÜÑÏÖØÓ°ÏìGPU»úȺµÄÕûÌåЧÂÊ¡£GPUDirect¼¼Êõ°ÑInfiniBandµÄRDMAÄÜÁ¦Ó¦Óõ½GPU½ÚµãÖ®¼äµÄͨÐÅ,¼ÓËÙGPU
ClusterÔÚHPC¼ÆËã»òAIѵÁ·¹ý³ÌÖеÄGPUͨÐÅЧÂÊ¡£ËüʵÏÖÁËRDMAÍø¿¨¶ÔGPUÄÚ´æ(GPU
Memory)µÄÖ±½Ó¶Áд,µ±GPU»úȺÖÐÁ½¸ö½ÚµãGPUÉϵĽø³ÌÐèÒª½øÐÐͨÐÅʱ,RDMAÍø¿¨¿ÉÒÔÔÚÁ½±ßµÄGPU
MemoryÖ®¼äÖ±½ÓʵÏÖRDMAÊý¾Ý°áÒÆ,Ïû³ýÁËCPU¶ÔÊý¾Ý¿½±´µÄ²ÎÓë,Ïû³ýÁ˲»±ØÒªÊý¾Ý¿½±´,¼õÉÙÁËPCIe×ÜÏߵķÃÎÊ´ÎÊý,´ó·ùÌá¸ßÁËͨÐÅÐÔÄÜ¡£
![alt=](http://image.danews.cc/upload/doc/20200908/5f574840f1ee4.png)
4.¿ÉÀ©Õ¹·Ö²ã´Î¾ÛºÏºÍ¹éÔ¼ÐÒé(SHARP)
SHARP(Scalable Hierarchical Aggregation and Reduction Protocol)ÊÇÒ»ÖÖ¼¯ºÏͨÐÅÍøÂçжÔؼ¼Êõ¡£
ÔÚ¸÷ÖÖHPCºÍAI¼ÆÖÐ,³£³£Óкܶ༯ºÏÀàͨÐÅ,ÕâЩ¼¯ºÏÀàͨÐÅÓÉÓÚÉæ¼°È«¾Ö,³£³£¶ÔÓ¦ÓóÌÐò²¢ÐÐЧÂʲúÉú¾Þ´óµÄÓ°Ïì¡£ÒµÄÚÓкܶàÑо¿Ê¹ÓöàÖÖÈí¼þ·½·¨À´ÓÅ»¯¾ÛºÏÀàͨÐŵÄЧÂÊ,µ«ÒÀÈ»ÐèÒªÔÚÍøÂçÖнøÐжà´ÎͨÐŲÅÄÜÍê³ÉÕûÌå¾ÛºÏ²Ù×÷,ÇÒºÜÈÝÒ×ÒýÈëÍøÂçÓµÈû¡£Óëµã¶ÔµãͨÐŶԱÈ,¾¹ý¶àÖÖ·½·¨ÓÅ»¯ºó,¼¯ºÏÀàͨÐŵÄÑÓ³ÙÈÔÈ»±Èµã¶ÔµãͨПßÒ»¸öÊýÁ¿¼¶ÒÔÉÏ¡£
Õë¶ÔÕâÖÖÇé¿ö,NVIDIA Mellanox´ÓEDR
InfiniBand½»»»»ú¿ªÊ¼ÒýÈëÁËSHARP¼¼Êõ,ÔÚ½»»»»úоƬÖм¯³ÉÁ˼ÆËãÒýÇæµ¥Ôª,¿ÉÒÔÖ§³Ö16λ¡¢32λ¼°64붨µã¼ÆËã»ò¸¡µã¼ÆËã,¿ÉÒÔÖ§³ÖÇóºÍ,Çó×îСֵ,Çó×î´óÖµ,ÇóÓë,Çó»ò¼°Òì»òµÈ¼ÆËã,¿ÉÒÔÖ§³ÖBarrier¡¢Reduce¡¢All-ReduceµÈ²Ù×÷¡£
Ôڶཻ»»»ú×é³ÉµÄ»úȺ»·¾³ÏÂ,Mellanox¶¨ÒåÁËÒ»ÕûÌ׵ĿÉÀ©Õ¹·Ö²ã´Î¾ÛºÏºÍ¹éÔ¼ÐÒé(SHARP)[3]жÔØ»úÖÆ,ÓɾۺϹÜÀíÆ÷(Aggregation
Manager)ÔÚÎïÀíÍØÆËÖй¹ÔìÒ»¸öÂß¼µÄSHARPÊ÷,ÓÉSHARPÊ÷ÖеĶà¸ö½»»»»ú²¢Ðзֲ¼Ê½´¦Àí¼¯ºÏÀàͨÐŲÙ×÷¡£µ±Ö÷»úÐèÒª½øÐÐÈ«¾ÖͨÐÅÀýÈçallreduceʱ,ËùÓÐÖ÷»ú°ÑͨÐÅÊý¾ÝÌá½»µ½¸÷×ÔÁ¬½ÓµÄ½»»»»ú,µÚÒ»¼¶½»»»»úÊÕµ½Êý¾Ýºó,»áʹÓÃÄÚÖõÄÒýÇæ¶ÔÊý¾Ý½øÐмÆËãºÍ´¦Àí,È»ºó°Ñ½á¹ûÊý¾ÝÌá½»µ½SHARPÊ÷µÄÉÏÒ»¼¶½»»»»ú,ÉÏÒ»¼¶½»»»»úҲʹÓÃ×Ô¼ºµÄÒýÇæ¶Ô´ÓÈô¸É¸ö½»»»»úÊÕÉÏÀ´½á¹ûÊý¾Ý×ö¾ÛºÏ´¦Àí,²¢¼ÌÐøÏòSHARPÊ÷µÄÉÏÒ»¼¶µÝ½»¡£µ½´ïSHARPÊ÷µÄ¸ù½»»»»úºó,¸ù½»»»»ú×ö×îºó¼ÆËã²¢°Ñ½á¹û»Ø·¢¸øËùÓеÄÖ÷»ú½Úµã¡£Í¨¹ýSHARP·½Ê½,¿ÉÒÔ´ó·ù½µµÍ¼¯ºÏͨÐŵÄÑÓ³Ù,¼õÉÙÍøÂçÓµÈû,²¢ÌáÉý»úȺϵͳµÄ¿ÉÀ©Õ¹ÐÔ¡£
![alt=](http://image.danews.cc/upload/doc/20200908/5f57485714cf8.png)
²úÆ··ÖÀà
- SHARPv1,ÔÚSwitch-IB2 EDR InfiniBandÉÏʵÏÖ,×î´óÖ§³Ö256 Byte¼¯ºÏͨÐÅжÔØ
- SHARPv2,Quantum HDR InfiniBandÉÏʵÏÖ,×î´óÖ§³Ö2G Byte¼¯ºÏͨÐÅжÔØ
Ó¦ÓÃ
¸ßÐÔÄܼÆËã(HPC)
¸ßÐÔÄܼÆËãÁìÓò´ó²¿·ÖÊǼÆËãÃܼ¯ÐÍÓ¦ÓÃ,¼ÆËã¹ý³ÌÖдóÁ¿ÏûºÄCPU/GPU¼ÆËã×ÊÔ´,ÇÒͨ³£°éËæ×ŶàÖÖÀàÐ͵ĵã¶ÔµãºÍ¼¯ºÏͨÐÅ,¶ÔͨÐŵĴø¿í¡¢Ê±ÑÓÐÔÄܺÜÃô¸Ð,ÐèҪͨÐÅÐÒéжÔØÒÔ¼õÉÙCPU/GPU×ÊÔ´ÕùÇÀ,¹ã·ºÓ¦ÓÃÁËRDMA¡¢GPUDirect¼°SHARP¼¼Êõ,ÌáÉýÕûÌå¼ÆËãÐÔÄÜ¡£
È˹¤ÖÇÄÜ(AI)
È˹¤ÖÇÄÜÊÇÄ¿Ç°×îÈȵãµÄ¼¼ÊõÖ®Ò»,ÈçºÎ¿ìËÙ¸ßЧÍê³ÉѵÁ·µÃµ½¸ß׼ȷÂʵÄÄ£ÐÍÊÇÈ˹¤ÖÇÄÜƽ̨µÄ¹Ø½¡¼¼ÊõÖ®Ò»¡£Ä¿Ç°Òµ½ç¹ã·ºÊ¹ÓÃGPU»òרÓÃAIоƬ×÷ΪÈ˹¤ÖÇÄܵÄѵÁ·Æ½Ì¨¼ÆËãºËÐÄÒÔ¼ÓËÙѵÁ·¹ý³Ì¡£È˹¤ÖÇÄÜѵÁ·Ò²ÊǵäÐ͵ļÆËãÃܼ¯ÐÍÓ¦ÓÃ,ÐèÒªÓ¦ÓÃͨÐÅÐÒéжÔØÒÔ½µµÍÑÓ³Ù¡£Í¬Ê±GPUDirect
RDMA¼¼ÊõÓÐЧÌáÉýÁËGPU»úȺ¼äµÄͨÐÅ´ø¿í,½µµÍͨÐÅÑÓ³Ù¡£
´ó¹æÄ£·Ö²¼Ê½ÑµÁ·ÖÐ,Ä¿Ç°±È½ÏÁ÷ÐеÄÊý¾Ý²¢ÐÐÉî¶ÈÉñ¾ÍøÂçËã·¨,ÐèÒªÀûÓöà¸öÉñ¾ÍøÂç²¢ÐÐѵÁ·,²¢ÔÚÿ¸öÉñ¾ÍøÂçѵÁ·ÍêÒ»¸öµü´úºó,ÔÚËùÓÐÉñ¾ÍøÂçÖ®¼ä½øÐÐÄ£ÐÍͬ²½¡£Ä£ÐÍͬ²½²Ù×÷³£³£Ê¹Óü¯ºÏÀàͨÐÅÀýÈçall-reduceʵÏÖ,ÆäÐÔÄܵĺûµ³ÉΪµÄ³ÉΪӰÏì·Ö²¼Ê½»úÆ÷ѧϰÐÔÄܵĹؼüÒòËØ¡£Í¨¹ýʹÓÃSHARP¼¼Êõ,¿ÉÒÔÃ÷ÏÔÌáÉýAIѵÁ·µÄallreduceͨÐÅÐÔÄÜ,¼Ó¿ìͨÐÅÄ£ÐÍͬ²½¹ý³Ì,´ó·ùÌáÉý»úȺµÄÕûÌåѵÁ·ÐÔÄÜ¡£
2020Äê09ÔÂ08ÈÕ ÓÚÉϺ£
°æȨ×÷Æ· δ¾Ðí¿É ÇëÎðתÔØ¡¡
|