junjie1475的个人资料

Playing with Performance Counters on m3 http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fgist.github.com%2Fjunjie1475%2F738ba50f9e933e6beeadf89a40fe1390&urlrefer=8b59702c4ef4ec0ae0a23b5ab1ad1488

AAPL Scheduler design:Smarter=Lower Power & Higher Frequency Fig 0 Apple Overall scheduling design Background Traditionally, the scheduler (reservation station) has been designed as a queue to preserve age information, allowing the oldest instruction to be selected when multiple instructions are to be issued. However, a queue structure consumes significant power and involves complex wiring. Each time a new instruction enters the queue, it has to collapse to fill the gap left by filled entries (see Fig. 1 below). Apple's approach utilizes an age matrix to track relative age state information, as described in the 2015 patent (http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fpatents.google.com%2Fpatent%2FUS20170024205A1&urlrefer=ad71d8863c37caca5df0bb188cfd95ad) for a non-shifting reservation station. The idea is that each entry in the scheduler includes additional state information to indicate its relative age compared to other entries (1 for older, 0 for younger). For brevity, I will use C-syntax, where M[X][Y] denotes the Yth bit of the X entry in the matrix. For example, if instruction A enters scheduler entry 0 and there are no other instructions yet, its state would be M[0]=11110 (assuming a pool of 5 entries), since it is older than all other unallocated entries (see Fig. 2 below).Fig 1 Many writes have to be generated when "collapsing". This structure is easy to update. Consider what happens to the relative age between entries when an instruction enters and exits the scheduler:Fig 2 Each entry has N bits filed to keep track of the relative age, where N is the size of the scheduler When an instruction first enters the scheduler, you allocate an entry in the array and set its age vector accordingly. Since the dispatch stage is in order, all instructions that have been inside the scheduler are older than the current instruction, so those bits are set to 0. For unallocated entries, we set their bits to 1 because when they are allocated, the current instruction must be older than them. When an instruction is issued from the scheduler, the corresponding entry is freed, and all bits indicating relative age to that entry (and allocated) should be set to 1. This is because the next instruction allocated in that entry must be younger than all current instructions in the scheduler. Now that we have this extra state information, how can we use it? We can refer to a 2016 patent (http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fpatents.google.com%2Fpatent%2FUS10120690B1&urlrefer=0b784ec7db4f8c01cf41b148f59c0e63) on reservation station early age indicator generation. The scheme Apple provided not only consumes less power and is easier to wire but also results in a shorter critical path and higher frequency. The goal is to find the oldest ready instruction to issue into the execution unit using just the matrix states. This can be envisioned as a "knockout round" based on the age of the instructions (see Fig. 3).Fig 3 perform knockout round between instructions Using the matrix, performing age comparisons between any two scheduler entries is straightforward: First, ensure that the current entry is valid (i.e., allocated). Next, check that the instruction is ready for execution. Finally, compare their age fields in the matrix: if M[X][Y] > 0, then X is older than Y. Apple has implemented this process with a few tweaks to shorten the critical path. Nowadays, a modern CPU has around a 20-24 gate delay for one cycle operation. The tree height is log2(Entries); for a scheduler with 32 entries, only 5 passes of comparison are needed.Fig 4 Overview of apple's approach Let's make a few assumptions: One instruction is dispatched to the scheduler per cycle (which holds true for Apple due to their dispatch buffer design). One instruction is dequeued from the scheduler to the execution unit per cycle. Apple separates the comparison into two pipeline stages. The first pass selects the oldest instruction from adjacent pairs of entries, storing the selected entries in registers. In the following cycle, the remaining passes are executed on the instructions selected from the previous cycle. This creates a potential issue: the selected instructions are based on information from the previous cycle rather than the current one. To illustrate the problem: In cycle 0, one instruction is selected to execute (it will dequeue in the next cycle), but the first pass does not use this information, as the scheduler only updates on the transition to the next cycle. In cycle 1, the selected instruction continues down the pipeline, selecting the same instruction that was previously chosen. This occurs because it’s based on outdated information; the reason that instruction was selected is that it was the oldest in the earlier cycle. Meanwhile, the first pass stage selects instructions based on the after-dequeue information. (Note: This is my personal interpretation, not explicitly stated in the 2016 patent.) To avoid this repetition issue, we calibrate the previously selected entries with the current cycle information. In the second stage of the pipeline, we recognize that an instruction has dequeued from the scheduler, so we can simply remove that instruction from the selected list, keeping the information up-to-date. When a new instruction enters the scheduler, it must be the youngest compared to all instructions already present. Thus, the only situation where it will be selected is if no instructions have been chosen from the previous pass. It’s an easy yet elegant solution, isn’t it? Taking a step back, this is essentially incremental adjustment! Since a maximum of only two instructions will be inserted or deleted from the array in any given cycle, it’s feasible to select from the previous sample with minimal adjustments, avoiding delays. This is far superior to redoing the entire selection process, which could lead to longer dependency paths. Thanks to Maynard Handley for pointing out the patents in his M1 exploration PDF. If you haven’t read them, you should definitely check them out.

如果你想了解SME请读这篇文章 https://github.com/apple-oss-distributions/xnu/blob/main/doc/arm/sme.md

Qualcomm Begins Optimizing Glibc For Their Oryon CPU Core http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fwww.phoronix.com%2Fnews%2FQualcomm-Olyon-1-Glibc&urlrefer=30f43730b946978dc78f3f47f7a58244 某人狂喜之前的测试一律作废

为什么UOP Cache、可变长指令集、 SMT都是垃圾 http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fvighneshiyer.com%2Fmisc%2Ftesla-talk%2FEric_Quinnell_Tesla_Talk.pdf&urlrefer=c6652d0041a3312f230914baea73c410 http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fvighneshiyer.com%2Fmisc%2Ftesla-talk%2F&urlrefer=230c11ea5fb1fee7fa657d7e038a436c

苹果功耗模型/计数器

对苹果彻底无语了😅 这次测试的时候第一时间连上Xcode结果发现报错developer disk image安装不上，尝试过重启，重装Xcode，升级系统都没有用，后来确认是苹果服务器不给m4 iPad DDI签名同一个批次的air都可以… 最逆天的是直到开售第二天才给签名，导致测试全部要用侧载。APPLE PLEASE MAKE IT EASIER😅

苹果m4 CPU的load queue 前几天在测试m4的ld/st queue的时候发现了一个相比a17不一样的现象，所以来发一篇帖子说一下。测试的probe是： Cache miss Load（First dependent chain） Independent Load ….. Cache miss Load （Second dependent chain）这个测试在A17上周期spike出现在100条independent load左右。而m4上这个数字异常的高，所以我认为此时的延迟是被其他的结构限制了（ROB）而不是load queue。所以我猜想苹果让ld queue条目提前释放如果前面没有还没有执行的store。进一步的测试probe Cache miss Load（First dependent chain） Store data depend on address of first load Independent Load ….. Cache miss Load （Second dependent chain）意外的是这个probe和上一个probe得出了同一个数字，苹果的ld/st queue entry是在execution time时才会allocate的，这也就意味着即使store指令在scheduler的情况下如果有了内存地址那load queue entry也会被释放。第三条probe和第二条差不多但是store指令改成了内存地址依赖上一条load指令，这次得到的数字回归到了合理的范围内，原因是如果当前store还不知道内存地址那load bypassing可能会预测错误，那当这条store指令获得内存地址后就要检查load queue是否有依赖当前的store指令但是从cache获得了数据，如果发现那就要把这条load和后面的指令全部重新执行。如果ld queue在这时候也提前释放的话，那后面store指令就不可以检查load指令是否需要重新执行了。

m4 GB6 单核4001 液氮

苹果SIMD/FP寄存器重命名最近在研究SME/SVE，arm的simd/fp寄存器重叠的方式和x86那边不太一样(d0,d1同时map到q0的高64和底64bit。对苹果的物理寄存器进行了一些测试。首先测试Q0寄存器的PRF Allocation limit， m3大约在500多左右，也就是说物理寄存器大小上限为500x128bit。接着把测试寄存器换成D0，结果和Q0寄存器一样。苹果会给所有大小的寄存器destination分配一个128bit的寄存器空间。继续测试把destination为d0和d1的指令混合发现limit还是一样，那么可以得出结论128bit寄存器里不存在register sharing，我猜测苹果会给每一个PHR分配了一个size参数，mapping table可以把一个128bit neon寄存器map到多个128bit的物理寄存器。这是一种相对简单的实现和x86那边的实现类似。

M4彻底拉了提升约等于0

On GPGPU core sharing and preemption First, comes terminology: warp: similar SIMD thread on CPU threadblock: A collection of warp that is guaranteed to be run on the same SM SM: minimum unit of execution core. GPU core: a collection of SM(assumed to be 4 here). The current GPU core design all converges to using 4 SM inside a GPU core. Which have independent schedulers and execution units, but they all share the same local data cache. At a high level, the top-level scheduler breaks down a kernel into multiple thread blocks, and these thread blocks are granted to be scheduled into one GPU core. And once a threadblock is emitted into a GPU core, it’s up to the core scheduler to decide how the warps to SM. Nowadays, different warps of different thread blocks of different kernels can all be scheduled into the same GPU core to achieve maximum utilization. So what the core scheduler does is to dispatch as many warps as possible to the SM (to hide latency). That being said, the number of warps can be scheduled into an SM depending on the resources that can be shared by different threads e.g. register file size, branch convergence stack, etc. Here is an example we have 2 thread blocks each with a warp size of 32 dispatched into the core scheduler and at each cycle, the core scheduler checks how many threads are executing in the SM once there is space(register space, convergence stack, and other states required to run the thread, etc..) and enough space in local shared memory, the warps will be dispatched. Inside the SM, there is one more scheduler that actively chooses warp to execute from the active set of warps(and this active set of warps is contained inside from the larger set of running warps), each cycle one warp is dispatched and decoded. Then they reach another queue(a small one). Each cycle in that queue, another scheduler actively detects dependence on the warp instruction and tries to find one to issue into the execution pipeline. In the end, once a warp has finished execution, its register and branch stack are released; This allowed another warp to enter execution of the current warp. Preemption Now we allow concurrent execution of different kernel(or even process)’s warp in a GPU core, but we still lack time sharing feature. For example, imagine you have a long-running background computation task but you still would like to refresh your display every 1/60 sec. You can do this at the thread block level where you move the thread blocks of the render kernel to the front of the core scheduler waitlist and the next time you have finished executing the preceding thread blocks you can dispatch the warp from the render thread blocks rather than the computation ones. In case the threadblocks take too long to execute, you can preempt at the instruction level, this is costly it requires you to save all the current register context and other states, then swap in the threadblocks of the render thread. once you have finished executing the rendering threads, you can recover the register state and start execution of the previous thread as usual. Apple gives 64kb of scratchpad memory to each GPU core and a thread block can use up to 32kb of it. I think this design is for the purpose that at any cycle, the core can keep two at least threadblocks active(if other resources are not constrained). So when execution meets the boundary of one thread block, the core scheduler can start dispatching threads from the next thread block if there are free slots in those SMs. So basically things happen like this, there is a waitlist inside each GPU core, and the top-level scheduler dispatches threadblocks to each GPU core. On every cycle the core scheduler checks if it is possible to dispatch thread blocks from the waitlist(i.e. check for scratchpad usage and free slots in SMs). If all resources are preserved, dispatch the warp into the SM, else continue to wait for the free slots.

CPU Load/Store乱序执行说一下load/store指令的乱序执行以及和load/store queue的关系大概再1990年中后期高性能CPU开始采用乱序执行load/store指令。 - load指令在获取到内存地址后就会马上执行 - store指令获得内存地址和要store的数据会马上执行这就造成了几个问题 - store指令的执行会改变内存状态，如果此时分支预测/异常需要回滚重新执行那store改变的内存状态撤回不了 - load指令执行的时候，load之前的store(程序顺序)不一定执行完，这么获取数据？第一个问题可以利用store queue解决。在store还是顺序执行的时候(前端rename)分配一个store queue条目(包含内存地址，存储数据，时间tag, 和一些其他的flag)，这样当store指令执行的时候并不会访问内存而是把需要存储的数据和内存地址存入store queue里，当store指令到达ROB头的时候就可以把数据存入内存。第二个问题可以用load queue解决。在load还是顺序执行的时候(前端rename)分配一个load queue条目(包含内存地址，时间tag, 和一些其他的flag)。当Load指令获得内存地址执行时，有2种情况 1) load之前的store指令已经运行完成(查看load指令时间tag之前的store指令是否已经执行完毕) - 这时只需要查看store queue里的地址和load指令符不符合，符合那就直接从store queue里取数据，不符合L1C取数据 2) load之前的store指令没有执行完成这种情况也是有2种选择1)store queue里有内存符合的内存地址 2)没有符合的地址，但是你不知道因为前面的store指令还没有执行完成，就需要预测当前load和store queue里的store有没有依赖，如果没有那就读取L1C，如果有那就继续在RS里等待重新执行(比如store forwarding，store指令执行时会在load queue查找有没有符合地址的load指令如果有就把数据递交给load指令)。既然是预测那就会出错，所以就需要在store指令retire的阶段查看load queue里的load有没有获取到正确的数据(即查看有没有match内存地址的load并且查看是否从L1C获取数据而非store queeu)，如果没有就重新执行load及后面的数据。总结 Load queue - store forwarding需要查找load queue传递数据 - 判断依赖判断是否失败 Store queue - 实现精确异常 - load指令bypass 实际实现远比上面说的更加复杂（比如如何处理不同大小的load/store）但是碍于篇幅再说下去会弄的很混乱。

说说现在IMR GPU的渲染管线先先说一说为什么GPU特别是图形部分的硬件对很多人来说不好理解 CPU总体概念上还是很简单的就是从一个地方取指令执行。 GPU就不一样了拿苹果来说苹果GPU有个arm coprocessor（cp）这个cp 会读取内存里的cpu编写好的数据以及指令（不一定是汇编级别的也可以是比较抽象高级的）而这个cp读取到cpu告诉他的指令后就会生成threadblock，具体这个操作是有专门的dispatcher硬件做的在光栅化硬件旁边。这里讨论光栅化渲染管线理论上mesh shading也一样。楼下继续

A17 pro CPU P核&E核代号 Device tree dump E Core sawtooth P Core Everest 和A16一样

关于苹果的推测调度和指令重试在90年代后期，高性能乱序执行处理器比如Alpha21264、intel pentium 4都采用了推测调度，下面解释什么是推测调度。我们知道根据Tomasulo算法[1]每条指令重命名后都会在Reservation station(RS)中等待source operand，在依赖的source operand准备好后，指令会被唤醒然后调度器会根据已有的信息选择一条指令进入执行单元执行。指令执行完后会传播tag。然后匹配tag的RS条目都会被唤醒、选择、读取source operand、送入执行单元执行，如此往复。但是在现代处理器中为了实现更高的频率，被唤醒到真正送入执行单元执行往往会被分成几级流水线(见图1)。如果我们要等到指令执行完成后才广播tag, 然后再选择、读取source operand, 那么2条依赖指令之间的执行会存在延迟，也就是在第一条指令执行完后，到第二条指令真正开始送入执行单元执行这中间隔了几个周期用来唤醒、选择和读取寄存器。而这会导致流水线气泡，降低处理器的IPC。图1 [2] 从被唤醒到送入执行单元执行被分成了多级解决方法是在将tag传播的阶段从执行完成提前到指令被选择的阶段。在这个阶段告知依赖此指令的指令，它们依赖的source operand会在接下来几个周期内产生。这样后面依赖的指令可以提前发射，在到达读取寄存器/bypass的阶段时，刚刚好读取到上一条依赖指令执行完成的数值(图2)。对于有些执行周期较短的指令我们不止需要依赖指令的tag，我们还需要观察依赖指令所依赖指令的tag[2]。不然即使提前了广播tag的阶段，在调度器比对tag并且选择发射的时候上一条依赖的指令已经执行完成了，而我们还要再等一个周期才可以进入执行单元执行(图3)。图2 [3] 2条依赖指令之间“背靠背”执行图3 [2] LOAD依赖低延迟ADD指令，延迟一个周期后才开始执行要确保执行指令的时候刚刚好可以读到依赖的source operand我们需要记录每个operand产生的延迟，假如说依赖指令的operand要在进入执行单元5周期后才会产生并且流水线从唤醒到进入执行单元要花费2个周期，那么我们只需要推迟(5-2=3)个周期后再唤醒当前指令在被送入执行单元读取bypass总线的时候刚刚好就可以读取到依赖指令的结果(图4)。图4 [3] 第二条指令依赖第一套Load指令，而LOAD指令执行周期(假设命中L1)是3个周期然而在会到来一个问题对于有些指令延迟不是固定的，比如说LOAD有没有命中L1缓存、TLB缓存和执行时如果发现依赖的store的数据没有出现，这些都会延迟LOAD的执行时间，所以如果我们不采取一些措施依赖LOAD的指令会读到bypass bus上的错误数据。这里我们可以选择如果发现了load指令执行时间大于预期，那么把LOAD后面的指令全部flush重新执行。可想而知这样做的代价是非常大的。所以今天的高性能处理器都选择了指令重试 (Replay)。我们记录所有依赖LOAD的指令，如果LOAD延迟不符合预期，那么所有依赖LOAD的指令都要重新执行，这里重新执行指的是重新发射，而不是重新fetch、decode... 而且只针对依赖于LOAD的指令。这里苹果的的实现可以参考他们的专利[4]。苹果目前的Replay实现还是比较简单的，和load依赖的所有指令都保留在issue queue里直到verification后才进行释放或者Replay。如果你想看更全面的分析的话建议看I. Kim和M. H. Lipasti的论文[5]，在里面列举了几种实现，有non-selective replay即在load指令issue到判断是否执行中间wakeup/select、发射的指令都要重新重新执行, 早期Alpha21264就用了类似的设计。以及只Replay和load指令间接相关或者直接相关的指令slective replay其中又有好几种实现。引用 [1] Tomasulo, Robert M. "An efficient algorithm for exploiting multiple arithmetic units." IBM Journal of research and Development 11.1 (1967): 25-33. [2] Stark, Jared, Mary D. Brown, and Yale N. Patt. "On pipelining dynamic instruction scheduling logic." Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture. 2000. [3] Perais, Arthur, et al. "Cost-effective speculative scheduling in high performance processors." ACM SIGARCH Computer Architecture News 43.3S (2015): 247-259. [4] Apple Inc. US10514925B1 - Load speculation recovery(2016) [5] I. Kim and M. H. Lipasti, "Understanding scheduling replay schemes," 10th International Symposium on High Performance Computer Architecture (HPCA'04), Madrid, Spain, 2004, pp. 198-209, doi: 10.1109/HPCA.2004.10011.

关于苹果的ROB设计有些朋友可能对目前苹果的ROB数据比较模糊，我总结一下目前数据有A站的630slot, dougall测的nop ~2300slot, 还有些写的是~330slot。那么这些数据到底哪一个是对的？其实要解答这个问题我们要先看苹果的ROB设计。苹果在Firestorm/Avalanche设计了用了一个7slot/rob row的设计，每个周期最少释放单位是一个row。但是所有会导致处理器flush pipeline的指令（比如分支指令（分支预测错误），store/load（load dependence预测错误，导致load后面的指令要重新执行或者访问地址出现异常））必须要占用每一行的最后一个slot。这些特殊指令导致了在真实运行的代码里往往达不到测试ROB size所得出的大小，举个例子： add xn, xm mov xn, xm store [add], xn 这里虽然只有3条指令但是却需要占用7个rot slot也就是一行。这里store指令相当于起到了一个分隔符的作用。最坏的情况下如果每条指令都要占用一行（比如连续的load，那么rob大小会限制在320(firestorm)。接下来我们谈谈History FIle，或者说PRRT(Physical register reclaim table)，这名词都代表了一个东西。我们知道在现代乱序执行处理器中，为了解决WAW, WAR,这类anti-dependence同时保持RAW这类real-dependence，都采用了寄存器重命名技术，在苹果的implementation中又一个逻辑寄存器到物理寄存器的mapping table，每当新指令到达rename stage的时候都会通过把自己source operand的逻辑寄存器通过查询mapping table替换成物理寄存器，然后为自己的destination operand申请一个新的物理寄存器，并且更新对应逻辑寄存器到物理寄存器的mapping。可是如果处理器在未来发生了异常、missprediction那么就需要刷新流水线回到异常发生时的状态。可是由于处理器是乱序执行的，mapping table的映射在每一个rename cycle都会被改变。那么如何回到异常发生时那一个周期mapping table的状态？很显然我们需要记录它的状态，这里苹果采用的是记录每一条指令对mapping table所做出的改变（保存在这条指令rename之前 mapping table对应的逻辑寄存器所映射的物理寄存器）。当发生异常时暂停最新的指令执行，从最新的指令所记录的上一个mapping开始往回走恢复到上一条指令的映射并且释放当前指令的物理寄存器，一直到发生异常的指令。下面给个例子 Inst1: X1 <- P4, Old P1 Inst2: X2 <- P5, Old P2 Inst3: X3 <- P6, Old P3 Inst4: X1 <- P7, Old P4 这里Xn <- Pn代表了这些这条指令会对mapping table做出的改变，Old Pn代表了执行这条指令之前逻辑寄存器Xn所映射到的物理寄存器。假设Inst1发生了异常我们需要从inst4开始回滚，具体操作是释放当前指令的物理寄存器并且把mapping table改变到执行这条指令之前此逻辑寄存器所map到的物理寄存器。这样一路回滚就可以把mapping table改回到之前的状态，从而继续从发生异常的位置开始执行。那么知道完理论后，我们知道要实现回滚我们需要记录每条指令对mapping table做出改变之前的状态，History FIle就是用来做这个的，理论上我们是可以把它当成ROB的一个记录项，但是对于很多没有修改物理寄存器的指令来说这就是一种浪费，所以在苹果的实现里，他是单独的一个表（在苹果的专利里叫History Table)。m1这个表的大小是~630这也就对应了A站的数据。具体测试方法是适用一条不需要占用物理寄存器但是会更改mapping的指令作为ROB的filler指令，比如zero-cycle mov, mov x5, 0。我们前面说到ROB最少释放单位是7slot/1 row, 根据maynard Handley的数据每周期可以retire 56条nop指令也就是8row。图源（http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fgithub.com%2Fname99-org%2FAArch64-Explore&urlrefer=10bc6402278b423f33bad3203ee924a6 Volume 1) 而History Buffer由于涉及到改变Mapping table，释放的速度就会慢一些16slot/cycle(同时释放物理寄存器的数据也是16/cycle)最后，我想说的是现代商用高性能处理器在不同的层面都做了大量的优化，很多具体实现的技巧课本是都不会说的，想要更深入的了解还是要多读paper,多看看Henry Wong等一系列研究处理器微架构的人的博客。引用Maynard Handley的一句话 “当你读完Hennesy & Patterson 的量化研究的时候你要意识到你仅仅完成了你路途中的10%, 不是90%”。关于苹果处理器还有很多值得讲的部分比如如何处理指令间的依赖，然后处理OoO内存访问，但是碍于篇幅限制就先写到这里。有机会再继续探讨。

Maynard Handley的ROB大小测试方法好久没有研究微架构方面的东西了，最近正好A17要出了就研究一下。 Maynard Handley在他的m1逆向工程文档里测试ROB、Int/FP物理寄存器大小和History Queue里都使用了和Henry Wong不同的方法。由于ROB是FIFO且只有最旧的指令retire了才会释放，所以我们需要用一条或者几条长延时dependence chain（这里他用的是FSQRT）来把ROB head堵住，后面再堆积nop指令。当堆积到一定数量时ROB会被占满，造成流水线stall，再后面的指令必须要等到前面delay chain指令retire才可以继续执行。通过观察在ROB被占满后的行为来判定ROB大小。图上x轴是填充nop指令的数量 y轴是耗费的cycle。蓝色的线代表了有delay chain的情况，黄色代表了没有delay chain下nop的throughput。当填充指令达到~2200的时候执行周期明显变长了，为什么？我们先假设ROB大小没有容量限制，那么执行周期应该要到nop=32*13*8=3328的时候才会开始增长，因为 nop<3228的情况下最少执行时间由FSQRT的延迟决定, nop>3228的情况下延迟由nop决定, 换句话说执行nop所消耗的时间会比执行fsqrt消耗的时间更久所以当fsqrt执行完后nop还没有执行完如下图红色的线代表了理论上如果ROB没有大小限制的情况。所以在nop=2200的时候发生了什么导致了不符合理论的情况？我们假设ROB大小是2200，如果填充指令的数量大于2200，比如说2300，那么在2200条指令后的nop都不可以执行因为当前ROB已经满了，必须要等到ROB head也就是delay chain(Fsqrt)指令retire并且释放ROB entry才可以继续执行。那么，执行2300条nop的时间=执行delay chain指令的时间+执行剩下100条nop指令的时间，作为对比假如ROB大小大于2300，那么执行的时间=执行delay chain指令的时间（也就是图上红色部分），因为假如机器没有stall, 在执行FSQRT的时候会并行的执行nop指令，nop执行完了FSQRT还没有执行完，所以执行时间等于FSQRT的执行时间。最终通过观察何时cycle 会突然增加并且不符合理论情况来确定ROB大小。为了更准确的观察你可以减少delay chain指令的数量，找到上述的现象可以复现下花费最少的延迟指令。如图下最后提供一下测试序列的伪代码 start: FSQRT r2, r2, r0 FSQRT r2, r2, r0 ... FSQRT r2, r2, r0 nop nop .... nop counter-- branch start if not zero 循环是为了减少操作系统上下文切换所带来影响。

给想了解CPU微架构的人推荐一篇paper

我让GB5 跑在了苹果的小核上。 A15 单核603分，多核比较低因为多出来2个线程。

一个iOS获取软件运行频率&功耗的小工具功耗是用我之前帖子的，频率还是一样的方法dylib注入到IPA里。可以分别计算E/P核Cluster的频率，采用了苹果私有的API（recount子系统）。性能影响：我在GB5上测试单核没影响多核影响<5%，实际性能取决于采样率以及线程数量更新线程list的频率，默认的采样率是1秒一次，5秒更新一次线程list, GB5大概会有10个线程。下载链接我会放楼下。不需要有Mac就可以用。

iOS Spec2017增加521.wrf修复int子项内存泄漏总算把521也弄好了。这个版本一键运行int项目只要1.3G内存也就是说3G以上的设备都可以跑。 Int10项 Fp13项全部完工。楼下链接欢迎测试～

iOS spec2017 增加FP12项benchmarks 感谢@jht5132 提供机器编译fortran项目。编译器为flang17/flang16 521.wrf_r稍后增加。至此iOS可以运行所有SPECrate2017 项目并且完全采用了llvm工具链。楼下下载链接

iOS spec2017更新增加功耗记录顺带修复了select all 运行顺序的问题功耗记录是用的我之前的方法另外如果你的内存小于4G建议分成2次运行中间要退出程序重启。

说一下跨操作系统下的benchmark对比（spec）首先先明确一点我下面说的都是算的user time所以进入内核的syscall 操作系统pagefault interpret都是不算在里面那么只要我们把线程定在同一个核心上这样cache不会迁移然后静态链接同一个库（libc/++）理论上来说同一款处理器出来的成绩不会有任何差别硬要说有上下文切换其他线程的影响那可以把其他所有的线程都锁在其他核心上然后cache flush 或者直接baremetal运行连操作系统都不存在完全没有上下文切换。@happy燕十三 @卧楼听松 @jht5132

iOS Spec2017增加502.gcc子项以及修复时钟问题好久没更新了这次更新了502gcc项目后iOS上就可以clang工具链来运行全部int子项楼下说一下为什么这么久才加502gcc 之后可能会添加 cpu核内功耗和每个子项的IPC。

iOS一个非常精确的cpu功耗测试方法前几天看XNU代码的时候看到一个函数可以直接调用CLPC（专利链接楼下）记录某个线程/进程的cpu（核内）功耗记录单位是纳焦耳。先上图楼下讲解原理（图为GeekBench6在A15上的成绩 1秒钟采样一次）。[图片]

macOS锁核工具我写了一个Kext可以用来锁核（还会把被锁的核心频率提到最高，开关省电模式都不会影响），由于运行在Kernel Space所以文件丢失操作系统panic请不要找我，Use at your own risk。并且还需要关闭SIP和启用Reduce Security选项。在macOS13.3.1 M1max(T6001)上测试成功。理论上M1/M1pro&M2/M2pro/max也是可以的。感谢@Jht5132 帮忙测试。楼下发链接。

浅谈现代GPU的图形管线在硬件上的实现纯业余主要说光栅化的图形管线。在说shader之前要先说下SIMD, SIMT和SISD这几个概念， SISD(Single Instruction Single Data)就是标量(Scaler)CPU所采用的。这也是为什么现代CPU(单核)性能专注于提高一条指令liú的执行效率(ILP) 相关的技术比如分支预卝测乱序执行超标量执行... 再说SIMD，SIMD(Single Instruction Multiple Data)在现代CPU上也有实现他们的指令和SISD混合在一起执行比如说向量计算就是一个典型的SIMD运用场景。再说SIMT(Single Instruction Multiple Thread) 它和SIMD很像但是每条指令可以有不同的control flow，可以想象成同一份指令liú被很多不同的线程执行。楼下继续

苹果CPU水平真不如X86 还是别吹了一直看到很多人吹苹果CPU怎么牛就很反感

Spec17 iOS app 源代码如果你要自己编译的话需要自己准备spec的Input data。链接楼下

浅谈苹果的scheduler和dispatch queue 以及如何测试它们 (图源Dougall) 如图我画红圈的地方，先说一下scheduler和reservation station是一个东西，那正常来说是不需要dispatch queue的，这里先需要所清楚这两个东西有什么区别。一条指令经过rename后会从前端进入到dispatch queue，如果此时scheduler有空闲的entry 那么这条指令会直接进入scheduler，但是如果没位置了那会呆在dispatch queue里一直等到scheduler有位置了才会进入scheduler并等待operand(s) 然后就是运行了，这里如果scheduler已经满了，指令会在dispatch queue等待只要指令还在dispatch queue中那么就算operands都到齐了也不能运行。简而言之就是在dispatch queue中的指令时不可以允许的，必须要在scheduler中才可以运行。那么这么做有什么好处图上你可以看到每个scheduler 只可以接受1个uops/cycle(除了 memory unit scheduler) 那么假如说前端同时解码&rename了大于6条加法指令，那么这里一共只有6个scheduler 如果这样的话就会造成前端的stall。所以只引出了dispatch queue的必要性，所有dispatch queue都可以接受8uops/cycle 这样可以存下多余scheduler不可以接受的uops来等待下一周期运行(假如operand(s)都有了)。这样不会造成前端的stall。最后感谢dougall帮助我解答我的问题。楼下继续说测试的方法。

MacOS下利用Kernel Extension实现绑定线程核心有什么用: 在Asahi Linux还没有适配某些机器的时候运行core-to-core延迟测试绑定某些任务到小核上例如 benchmark 怎么实现的: 在kernel space下修改线程结构体的bound_processor变量=processor_array[cpuid], 苹果XNU kernel相关文件: http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fgithub.com%2Fapple-oss-distributions%2Fxnu%2Fblob%2F5c2921b07a2480ab43ec66f5b9e41cb872bc554f%2Fosfmk%2Fkern%2Fsched_prim.c&urlrefer=89ecb390f22f90bdc76c63a285da313b 的thread_bind_internal()函数，thread_bind()函数和sysctl_thread_bind_cpuid()函数怎么加载kext可以参考苹果dev网站的文档

关于M1的Load/Store queue 优化 load store queue大家应该很熟悉了吧，现代高性能处理器很重要的一个元器件。一般来说store指令只可以在完成commit阶段后写入内存（为了保证precise expection）但是苹果可以在前面的指令还没有commit的时候就直接将后续的store写入内存你可能想文这不是违反了precise exception吗没错但是苹果只在前面指令是不可能触发异常的情况下提前写入内存这样做的好处是可以以最快的速度free store buffer

Intel联合创始人摩尔去世了 RIP

Apple Watch Series 9 (S9 SoC)将采用A16 E核今天看Xcode头文件突然发现的继S6, S7, S8三代SoC后苹果终于更新了。到时候应该是地表最强手表SoC @Jht5132 @A12Bionic @NPacific @卧楼听松

iOS直接调用苹果AMX处理器在A15上成功运行比较神奇的是苹果在iOS上竟然没有封锁不符合苹果的调性

iOS spec2017 更新增加Fp子项之前一度打算放弃更新因为实在是没什么人测试，但是想了下还是算了继续更新。之前测试小核我实在是没办法了不用管调用什么Api 都不可以控制线程在小核之间乱跳如果有知道怎么解决的大佬可以说下。我试过2个方法第一个就是我上一个帖子v1.1.1的方法这里说下第二个把Qos设置成Utility 为什么不是background因为background跳动非常严重根本跑不满，这个方法有一个好处就是不会跳到大核上但是核间跳动比v1.1.1的要多导致成绩会比较低。

iOS Spec2017 App 添加测试E核功能顺带修复了之前select all会运行道小核上的bug 怎么实现的我用2个QOS为user interactive的线程占住大核小再用一个Qos也是user interactive但是thread优先级-5的线程运行。benchmark线程运行到大核的时间>1% 也算勉强跑满了核间切换比较少

spec2017 iOS App release 目前可以运行除了502_gcc所有int子项，编译器C/C++: Apple clang version 14.0.0 (clang-1400.0.29.202) Fortran: Flang-new version 17.0.0 旗标都是 -Ofast -arch arm64UI是用swiftui写的限制 1. 运行时要退出需要手动退出APP 2. 大小有972MB 因为我把x264的input提前decode好再放上去运行3. 最低系统版本IOS16 因为UI用了一些IOS16的featureIPA文件下载链接Google Drive: http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1jzPYuH2GYg8LTSQsGelYzPA6ufS0b7wD%2Fview%3Fusp%3DsharingOne&urlrefer=227ef391fb489c4b6af79706fe06b448Drive: https://1drv.ms/u/s!AhBf1NYB2tFIc3-tBwRWmKJjbkE?e=hYoFWg怎么安装因为苹果的开发者账号太贵了买不起所以需要侧载如果你有电脑可以用altstore，sideloadly etc... 安装如果你没有电脑目前可以用Scarlet(https://usescarlet.com)进行安装感谢 @Jht5132 提供机器进行编译以及提供意见目前就想到这么多如果发现bu

iOS运行spec2017 fortran548子项目首先感谢@jht5132 提供机器帮编译编译器为llvm flang17（http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fgithub.com%2Fllvm%2Fllvm-project%2Ftree%2Fmain%2Fflang-Ofast&urlrefer=c7921d109bec8cecdd0d0e1ef723197e）旗标 -Ofast 室温下ref分数8.4

IOS上运行spec2017 我在iphone13p 上运行了spec2017 505.mcf ref -O3 235秒换算成分数6.87 如果以后有时间我会把做一个APP 让IOS也可以运行spec17 @Jht5132

缓存有什么用

处理器流水线深度对频率以及IPC/CPI的影响

ROB重排序缓冲区有什么用？如题

为什么riscv和arm不能取代X86的原因

说一下m1m/p小核调度的问题 m1的小核有4个 m1p/m只有2个小核如果有进程都跑在小核上那m1不是快过m1p/m了？很显然苹果不允许这个发生所以如果你看m1的小核频率最高只有1G m1p/m如果只有一个线程的时候也是一样的小核跑1G但是如果线程数量>=2 苹果会把小核频率调整到2.06G, 这样就防止m1p/m跑小核任务的时候比m1还慢了这样做会牺牲功耗这也是为什么m1p/m 能耗曲线功耗低的那部分是不如m1的

13代这波感觉也就那样吧纯超频微架构基本没变

Andrei 根据雷丘推算的ADL vs Zen4 vs RTL spec17成绩

关于新benchmark gpu score 看了下他们的白皮书里面有提到Alternative benchmarks seem to have atleast 6 or even 10 years old tests included in their latest releases. We will not mention anybrands here to avoid unnecessary discrediting. 说的谁不用我说了吧

极客湾这波操作是真的nb 不亚于12代那次和A16比然后得出差几十倍的差距是想说明什么？和m1u这种没有光追的显卡比光追性能？这波是收了nv多少钱

A16是Armv8.6不是ARMv9

A16 dieshot ,

关于Spec17在macOS下自动化测试前端大概就这样了纯C++ 用了imgui的库在我最垃圾的笔记本上(I5-5350U)上 GUI界面 CPU占用0.4%-0.5%左右后端渲染用的是glfw+opengl3/glfw+metal

Longhorn去高通了

Firestorm VS. Golden Cove VS. Icestorm VS. Gracemont微架构昨天发了个帖子没人看可能是大家看不太懂那我这次写详细点数据来源是dougallj逆向工程m1的代码和Edison-Chan的文章乱序执行最大窗口(填充指令是nop) firestorm 2295 VS Golden Cove 512 VS Icestorm 415 VS Gracemont 256 乱序执行窗口(填充指令是mov reg, 0) firestorm 623 VS Golden Cove N.A VS Icestorm 111 VS Gracemont N.A PS:暂时没有Alder lake的数据乱序执行窗口(第一条填充指令是cmp, 1-100条填充指令是str reg, 101-230条填充指令是b.ne，其余的填充指令是mov reg, 0) Firestorm 853 Load缓存大小 firestorm 129 VS Golden Cove 192 VS Icestorm ~29 VS Gracemont ~80-82 Stroe缓存大小 firestorm 108 VS Golden Cove 112 VS Icestorm ~36 VS Gracemont 48 通用寄存器数量(只能算speculate的部分至于为什么请看Henry Wong的文章) firestorm 434 VS Golden Cove 240 VS Icestorm 87 VS Gracemont 192 reservation station大小 firestorm 158 VS Golden Cove N.A VS Icestorm 32 VS Gracemont N.A

firestorm&icestorm微架构的一些数据在读dougallj逆向工程m1代码的时候发现一些没公布在网上firestorm&icestorm微架构的一些数据

m1 max跑spec2017 gcc12.1.0分数简直残暴 548子项比A站跑的高了接近2倍 x264高了8分编译参数是-Ofast

？？？无话可说了

关于RISC(arm)和CISC(x86)的指令数量误区我主要是帮petrification代写的大家通常可能会认为RISC要用好几条指令才可以完成CISC一条指令是的我一开始也是这么以为的这是spec 2006的指令数量 spec int 整数平均，ARM指令数比X86略高一不过主要是hmmer这一项带来的其他子项有的比X86少，有的差不多，有的多一点 spec fp avg比X86少很多 µ-ops是处理器真正执行的指令也就是微指令在微指令架构的CPU里面，编译器编译出来的机器码和汇编代码并没有发生什么变化。但在指令译码的阶段，指令译码器“翻译”出来的，不再是某一条CPU指令。译码器会把一条机器码，“ 翻译”成好几条“微指令”。这里的一条条微指令，就不再是CISC风格的了，而是变成了固定长度的RISC风格的了。如果你比较µ-ops的话全是ARM更少少非常多下面讨论下为什么这是韦易笑的话: 评论同学觉得cisc指令数多，应该更短，就像写文章，字母越多的语言，写出来的越短，直观想来是这么回事情，问题是x86的cisc太粗糙了，描述逻辑不够细致，举个例子，就只能从源寄存器加到目标寄存器，你要计算x=a+b，就需要两条指令，而arm下面支持源1+源2到另外一个目标，只需要一条指令；再比如说栈操作，x86指令虽多，栈操作却远没有arm那么灵活；x86指令很多都是双操作数，arm指令很多三到四个操作数。下面给段具体例子：近似计算混色 unsigned int AlphaBlend(unsigned int c1, unsigned int c2) { unsigned char a1 = (c1 >> 24) & 0xff; unsigned char r1 = (c1 >> 16) & 0xff; unsigned char g1 = (c1 >> 8) & 0xff; unsigned char b1 = (c1 & 0xff); unsigned char a2 = (c2 >> 24) & 0xff; unsigned char r2 = (c2 >> 16) & 0xff; unsigned char g2 = (c2 >> 8) & 0xff; unsigned char b2 = (c2 & 0xff); r1 = (r1 * a1 + r2 * (255 - a1)) >> 8; g1 = (g1 * a1 + g2 * (255 - g1)) >> 8; b1 = (b1 * a1 + b2 * (255 - b1)) >> 8; a1 = (a1 + a2 * (255 - a1)) >> 8; return (a1 << 24) | (r1 << 16) | (g1 << 8) | b1; } 同样使用 gcc来编译，同样加上了 -Os 优化选项（为代码尺寸进行优化），结果如下：arm：32条指令，共计128字节x86：60条指令，共计140字节修改为 gcc -O2 优化编译，结果如下：arm：33条指令，132字节x86：55条指令，153字节 x86指令很多都是双操作数，arm指令很多三到四个操作数 ARM从V8开始指令集就变得复杂了。（ARMv4 支持 300 条指令 [14]。在 ARMv8 [22] 中，这个数字已经增长到一千多个。总体而言，ARMv8 很复杂，并且有许多指令格式。）只有 25% 的 CISC 指令在 95% 的时间内被使用这也是导致RISC指令甚至比CISC少的原因最后放一下spec2006与EMBEDDED下的 X86和Arm的执行周期对比参考 A_Study_on_the_Impact_of_Instruction_Set_Architectures_on_Process.pdf