junjie1475 junjie1475
关注数: 37 粉丝数: 159 发帖数: 2,798 关注贴吧数: 23
AAPL Scheduler design:Smarter=Lower Power & Higher Frequency Fig 0 Apple Overall scheduling design Background Traditionally, the scheduler (reservation station) has been designed as a queue to preserve age information, allowing the oldest instruction to be selected when multiple instructions are to be issued. However, a queue structure consumes significant power and involves complex wiring. Each time a new instruction enters the queue, it has to collapse to fill the gap left by filled entries (see Fig. 1 below). Apple's approach utilizes an age matrix to track relative age state information, as described in the 2015 patent (http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fpatents.google.com%2Fpatent%2FUS20170024205A1&urlrefer=ad71d8863c37caca5df0bb188cfd95ad) for a non-shifting reservation station. The idea is that each entry in the scheduler includes additional state information to indicate its relative age compared to other entries (1 for older, 0 for younger). For brevity, I will use C-syntax, where M[X][Y] denotes the Yth bit of the X entry in the matrix. For example, if instruction A enters scheduler entry 0 and there are no other instructions yet, its state would be M[0]=11110 (assuming a pool of 5 entries), since it is older than all other unallocated entries (see Fig. 2 below).Fig 1 Many writes have to be generated when "collapsing". This structure is easy to update. Consider what happens to the relative age between entries when an instruction enters and exits the scheduler:Fig 2 Each entry has N bits filed to keep track of the relative age, where N is the size of the scheduler When an instruction first enters the scheduler, you allocate an entry in the array and set its age vector accordingly. Since the dispatch stage is in order, all instructions that have been inside the scheduler are older than the current instruction, so those bits are set to 0. For unallocated entries, we set their bits to 1 because when they are allocated, the current instruction must be older than them. When an instruction is issued from the scheduler, the corresponding entry is freed, and all bits indicating relative age to that entry (and allocated) should be set to 1. This is because the next instruction allocated in that entry must be younger than all current instructions in the scheduler. Now that we have this extra state information, how can we use it? We can refer to a 2016 patent (http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fpatents.google.com%2Fpatent%2FUS10120690B1&urlrefer=0b784ec7db4f8c01cf41b148f59c0e63) on reservation station early age indicator generation. The scheme Apple provided not only consumes less power and is easier to wire but also results in a shorter critical path and higher frequency. The goal is to find the oldest ready instruction to issue into the execution unit using just the matrix states. This can be envisioned as a "knockout round" based on the age of the instructions (see Fig. 3).Fig 3 perform knockout round between instructions Using the matrix, performing age comparisons between any two scheduler entries is straightforward: First, ensure that the current entry is valid (i.e., allocated). Next, check that the instruction is ready for execution. Finally, compare their age fields in the matrix: if M[X][Y] > 0, then X is older than Y. Apple has implemented this process with a few tweaks to shorten the critical path. Nowadays, a modern CPU has around a 20-24 gate delay for one cycle operation. The tree height is log2(Entries); for a scheduler with 32 entries, only 5 passes of comparison are needed.Fig 4 Overview of apple's approach Let's make a few assumptions: One instruction is dispatched to the scheduler per cycle (which holds true for Apple due to their dispatch buffer design). One instruction is dequeued from the scheduler to the execution unit per cycle. Apple separates the comparison into two pipeline stages. The first pass selects the oldest instruction from adjacent pairs of entries, storing the selected entries in registers. In the following cycle, the remaining passes are executed on the instructions selected from the previous cycle. This creates a potential issue: the selected instructions are based on information from the previous cycle rather than the current one. To illustrate the problem: In cycle 0, one instruction is selected to execute (it will dequeue in the next cycle), but the first pass does not use this information, as the scheduler only updates on the transition to the next cycle. In cycle 1, the selected instruction continues down the pipeline, selecting the same instruction that was previously chosen. This occurs because it’s based on outdated information; the reason that instruction was selected is that it was the oldest in the earlier cycle. Meanwhile, the first pass stage selects instructions based on the after-dequeue information. (Note: This is my personal interpretation, not explicitly stated in the 2016 patent.) To avoid this repetition issue, we calibrate the previously selected entries with the current cycle information. In the second stage of the pipeline, we recognize that an instruction has dequeued from the scheduler, so we can simply remove that instruction from the selected list, keeping the information up-to-date. When a new instruction enters the scheduler, it must be the youngest compared to all instructions already present. Thus, the only situation where it will be selected is if no instructions have been chosen from the previous pass. It’s an easy yet elegant solution, isn’t it? Taking a step back, this is essentially incremental adjustment! Since a maximum of only two instructions will be inserted or deleted from the array in any given cycle, it’s feasible to select from the previous sample with minimal adjustments, avoiding delays. This is far superior to redoing the entire selection process, which could lead to longer dependency paths. Thanks to Maynard Handley for pointing out the patents in his M1 exploration PDF. If you haven’t read them, you should definitely check them out.
苹果m4 CPU的load queue 前几天在测试m4的ld/st queue的时候发现了一个相比a17不一样的现象,所以来发一篇帖子说一下。 测试的probe是: Cache miss Load(First dependent chain) Independent Load ….. Cache miss Load (Second dependent chain) 这个测试在A17上周期spike出现在100条independent load左右。而m4上这个数字异常的高,所以我认为此时的延迟是被其他的结构限制了(ROB)而不是load queue。 所以我猜想苹果让ld queue条目提前释放如果前面没有还没有执行的store。 进一步的测试probe Cache miss Load(First dependent chain) Store data depend on address of first load Independent Load ….. Cache miss Load (Second dependent chain) 意外的是这个probe和上一个probe得出了同一个数字,苹果的ld/st queue entry是在execution time时才会allocate的,这也就意味着即使store指令在scheduler的情况下如果有了内存地址那load queue entry也会被释放。 第三条probe和第二条差不多但是store指令改成了内存地址依赖上一条load指令, 这次得到的数字回归到了合理的范围内,原因是如果当前store还不知道内存地址那load bypassing可能会预测错误,那当这条store指令获得内存地址后就要检查load queue是否有依赖当前的store指令但是从cache获得了数据,如果发现那就要把这条load和后面的指令全部重新执行。 如果ld queue在这时候也提前释放的话,那后面store指令就不可以检查load指令是否需要重新执行了。
On GPGPU core sharing and preemption First, comes terminology: warp: similar SIMD thread on CPU threadblock: A collection of warp that is guaranteed to be run on the same SM SM: minimum unit of execution core. GPU core: a collection of SM(assumed to be 4 here). The current GPU core design all converges to using 4 SM inside a GPU core. Which have independent schedulers and execution units, but they all share the same local data cache. At a high level, the top-level scheduler breaks down a kernel into multiple thread blocks, and these thread blocks are granted to be scheduled into one GPU core. And once a threadblock is emitted into a GPU core, it’s up to the core scheduler to decide how the warps to SM. Nowadays, different warps of different thread blocks of different kernels can all be scheduled into the same GPU core to achieve maximum utilization. So what the core scheduler does is to dispatch as many warps as possible to the SM (to hide latency). That being said, the number of warps can be scheduled into an SM depending on the resources that can be shared by different threads e.g. register file size, branch convergence stack, etc. Here is an example we have 2 thread blocks each with a warp size of 32 dispatched into the core scheduler and at each cycle, the core scheduler checks how many threads are executing in the SM once there is space(register space, convergence stack, and other states required to run the thread, etc..) and enough space in local shared memory, the warps will be dispatched. Inside the SM, there is one more scheduler that actively chooses warp to execute from the active set of warps(and this active set of warps is contained inside from the larger set of running warps), each cycle one warp is dispatched and decoded. Then they reach another queue(a small one). Each cycle in that queue, another scheduler actively detects dependence on the warp instruction and tries to find one to issue into the execution pipeline. In the end, once a warp has finished execution, its register and branch stack are released; This allowed another warp to enter execution of the current warp. Preemption Now we allow concurrent execution of different kernel(or even process)’s warp in a GPU core, but we still lack time sharing feature. For example, imagine you have a long-running background computation task but you still would like to refresh your display every 1/60 sec. You can do this at the thread block level where you move the thread blocks of the render kernel to the front of the core scheduler waitlist and the next time you have finished executing the preceding thread blocks you can dispatch the warp from the render thread blocks rather than the computation ones. In case the threadblocks take too long to execute, you can preempt at the instruction level, this is costly it requires you to save all the current register context and other states, then swap in the threadblocks of the render thread. once you have finished executing the rendering threads, you can recover the register state and start execution of the previous thread as usual. Apple gives 64kb of scratchpad memory to each GPU core and a thread block can use up to 32kb of it. I think this design is for the purpose that at any cycle, the core can keep two at least threadblocks active(if other resources are not constrained). So when execution meets the boundary of one thread block, the core scheduler can start dispatching threads from the next thread block if there are free slots in those SMs. So basically things happen like this, there is a waitlist inside each GPU core, and the top-level scheduler dispatches threadblocks to each GPU core. On every cycle the core scheduler checks if it is possible to dispatch thread blocks from the waitlist(i.e. check for scratchpad usage and free slots in SMs). If all resources are preserved, dispatch the warp into the SM, else continue to wait for the free slots.
CPU Load/Store乱序执行 说一下load/store指令的乱序执行以及和load/store queue的关系 大概再1990年中后期 高性能CPU开始采用乱序执行load/store指令。 - load指令在获取到内存地址后就会马上执行 - store指令获得内存地址和要store的数据会马上执行 这就造成了几个问题 - store指令的执行会改变内存状态,如果此时分支预测/异常 需要回滚重新执行那store改变的内存状态撤回不了 - load指令执行的时候,load之前的store(程序顺序)不一定执行完,这么获取数据? 第一个问题可以利用store queue解决。在store还是顺序执行的时候(前端rename)分配一个store queue条目(包含内存地址,存储数据,时间tag, 和一些其他的flag),这样当store指令执行的时候并不会访问内存而是把需要存储的数据和内存地址存入store queue里,当store指令到达ROB头的时候就可以把数据存入内存。 第二个问题可以用load queue解决。在load还是顺序执行的时候(前端rename)分配一个load queue条目(包含内存地址,时间tag, 和一些其他的flag)。当Load指令获得内存地址执行时,有2种情况 1) load之前的store指令已经运行完成(查看load指令时间tag之前的store指令是否已经执行完毕) - 这时只需要查看store queue里的地址和load指令符不符合,符合那就直接从store queue里取数据,不符合L1C取数据 2) load之前的store指令没有执行完成 这种情况也是有2种选择1)store queue里有内存符合的内存地址 2)没有符合的地址,但是你不知道因为前面的store指令还没有执行完成,就需要预测当前load和store queue里的store有没有依赖,如果没有那就读取L1C,如果有那就继续在RS里等待重新执行(比如store forwarding,store指令执行时会在load queue查找有没有符合地址的load指令 如果有就把数据递交给load指令)。既然是预测那就会出错,所以就需要在store指令retire的阶段查看load queue里的load有没有获取到正确的数据(即查看有没有match内存地址的load并且查看是否从L1C获取数据而非store queeu),如果没有就重新执行load及后面的数据。 总结 Load queue - store forwarding需要查找load queue传递数据 - 判断依赖判断是否失败 Store queue - 实现精确异常 - load指令bypass 实际实现远比上面说的更加复杂(比如如何处理不同大小的load/store)但是碍于篇幅再说下去会弄的很混乱。
关于苹果的推测调度和指令重试 在90年代后期,高性能乱序执行处理器比如Alpha21264、intel pentium 4都采用了推测调度,下面解释什么是推测调度。 我们知道根据Tomasulo算法[1]每条指令重命名后都会在Reservation station(RS)中等待source operand,在依赖的source operand准备好后,指令会被唤醒然后调度器会根据已有的信息选择一条指令进入执行单元执行。指令执行完后会传播tag。然后匹配tag的RS条目都会被唤醒、选择、读取source operand、送入执行单元执行,如此往复。 但是在现代处理器中为了实现更高的频率,被唤醒到真正送入执行单元执行往往会被分成几级流水线(见图1)。如果我们要等到指令执行完成后才广播tag, 然后再选择、读取source operand, 那么2条依赖指令之间的执行会存在延迟,也就是在第一条指令执行完后,到第二条指令真正开始送入执行单元执行这中间隔了几个周期用来唤醒、选择和读取寄存器。而这会导致流水线气泡,降低处理器的IPC。图1 [2] 从被唤醒到送入执行单元执行被分成了多级 解决方法是在将tag传播的阶段从执行完成提前到指令被选择的阶段。在这个阶段告知依赖此指令的指令,它们依赖的source operand会在接下来几个周期内产生。这样后面依赖的指令可以提前发射,在到达读取寄存器/bypass的阶段时,刚刚好读取到上一条依赖指令执行完成的数值(图2)。对于有些执行周期较短的指令我们不止需要依赖指令的tag,我们还需要观察依赖指令所依赖指令的tag[2]。不然即使提前了广播tag的阶段,在调度器比对tag并且选择发射的时候上一条依赖的指令已经执行完成了,而我们还要再等一个周期才可以进入执行单元执行(图3)。图2 [3] 2条依赖指令之间“背靠背”执行图3 [2] LOAD依赖低延迟ADD指令,延迟一个周期后才开始执行 要确保执行指令的时候刚刚好可以读到依赖的source operand我们需要记录每个operand产生的延迟,假如说依赖指令的operand要在进入执行单元5周期后才会产生并且流水线从唤醒到进入执行单元要花费2个周期,那么我们只需要推迟(5-2=3)个周期后再唤醒当前指令在被送入执行单元读取bypass总线的时候刚刚好就可以读取到依赖指令的结果(图4)。图4 [3] 第二条指令依赖第一套Load指令,而LOAD指令执行周期(假设命中L1)是3个周期 然而在会到来一个问题对于有些指令延迟不是固定的,比如说LOAD有没有命中L1缓存、TLB缓存和执行时如果发现依赖的store的数据没有出现,这些都会延迟LOAD的执行时间,所以如果我们不采取一些措施依赖LOAD的指令会读到bypass bus上的错误数据。这里我们可以选择如果发现了load指令执行时间大于预期,那么把LOAD后面的指令全部flush重新执行。可想而知这样做的代价是非常大的。所以今天的高性能处理器都选择了指令重试 (Replay)。我们记录所有依赖LOAD的指令,如果LOAD延迟不符合预期,那么所有依赖LOAD的指令都要重新执行,这里重新执行指的是重新发射,而不是重新fetch、decode... 而且只针对依赖于LOAD的指令。这里苹果的的实现可以参考他们的专利[4]。 苹果目前的Replay实现还是比较简单的,和load依赖的所有指令都保留在issue queue里直到verification后才进行释放或者Replay。如果你想看更全面的分析的话建议看I. Kim和M. H. Lipasti的论文[5],在里面列举了几种实现,有non-selective replay即在load指令issue到判断是否执行中间wakeup/select、发射的指令都要重新重新执行, 早期Alpha21264就用了类似的设计。以及只Replay和load指令间接相关或者直接相关的指令slective replay其中又有好几种实现。 引用 [1] Tomasulo, Robert M. "An efficient algorithm for exploiting multiple arithmetic units." IBM Journal of research and Development 11.1 (1967): 25-33. [2] Stark, Jared, Mary D. Brown, and Yale N. Patt. "On pipelining dynamic instruction scheduling logic." Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture. 2000. [3] Perais, Arthur, et al. "Cost-effective speculative scheduling in high performance processors." ACM SIGARCH Computer Architecture News 43.3S (2015): 247-259. [4] Apple Inc. US10514925B1 - Load speculation recovery(2016) [5] I. Kim and M. H. Lipasti, "Understanding scheduling replay schemes," 10th International Symposium on High Performance Computer Architecture (HPCA'04), Madrid, Spain, 2004, pp. 198-209, doi: 10.1109/HPCA.2004.10011.
关于苹果的ROB设计 有些朋友可能对目前苹果的ROB数据比较模糊,我总结一下目前数据有A站的630slot, dougall测的nop ~2300slot, 还有些写的是~330slot。那么这些数据到底哪一个是对的?其实要解答这个问题我们要先看苹果的ROB设计。苹果在Firestorm/Avalanche设计了用了一个7slot/rob row的设计,每个周期最少释放单位是一个row。但是所有会导致处理器flush pipeline的指令(比如分支指令(分支预测错误),store/load(load dependence预测错误,导致load后面的指令要重新执行 或者 访问地址出现异常))必须要占用每一行的最后一个slot。这些特殊指令导致了在真实运行的代码里往往达不到测试ROB size所得出的大小,举个例子: add xn, xm mov xn, xm store [add], xn 这里虽然只有3条指令但是却需要占用7个rot slot也就是一行。这里store指令相当于起到了一个分隔符的作用。最坏的情况下如果每条指令都要占用一行(比如连续的load,那么rob大小会限制在320(firestorm)。 接下来我们谈谈History FIle,或者说PRRT(Physical register reclaim table),这名词都代表了一个东西。 我们知道在现代乱序执行处理器中,为了解决WAW, WAR,这类anti-dependence同时保持RAW这类real-dependence,都采用了寄存器重命名技术,在苹果的implementation中又一个逻辑寄存器到物理寄存器的mapping table,每当新指令到达rename stage的时候都会通过把自己source operand的逻辑寄存器通过查询mapping table替换成物理寄存器,然后为自己的destination operand申请一个新的物理寄存器,并且更新对应逻辑寄存器到物理寄存器的mapping。可是如果处理器在未来发生了异常、missprediction那么就需要刷新流水线回到异常发生时的状态。可是由于处理器是乱序执行的,mapping table的映射在每一个rename cycle都会被改变。那么如何回到异常发生时那一个周期mapping table的状态?很显然我们需要记录它的状态,这里苹果采用的是记录每一条指令对mapping table所做出的改变(保存在这条指令rename之前 mapping table对应的逻辑寄存器所映射的物理寄存器)。当发生异常时暂停最新的指令执行,从最新的指令所记录的上一个mapping开始往回走恢复到上一条指令的映射 并且释放当前指令的物理寄存器,一直到发生异常的指令。下面给个例子 Inst1: X1 <- P4, Old P1 Inst2: X2 <- P5, Old P2 Inst3: X3 <- P6, Old P3 Inst4: X1 <- P7, Old P4 这里Xn <- Pn代表了这些这条指令会对mapping table做出的改变,Old Pn代表了执行这条指令之前逻辑寄存器Xn所映射到的物理寄存器。 假设Inst1发生了异常我们需要从inst4开始回滚,具体操作是释放当前指令的物理寄存器并且把mapping table改变到执行这条指令之前此逻辑寄存器所map到的物理寄存器。这样一路回滚就可以把mapping table改回到之前的状态,从而继续从发生异常的位置开始执行。 那么知道完理论后,我们知道要实现回滚我们需要记录每条指令对mapping table做出改变之前的状态,History FIle就是用来做这个的,理论上我们是可以把它当成ROB的一个记录项,但是对于很多没有修改物理寄存器的指令来说这就是一种浪费,所以在苹果的实现里,他是单独的一个表(在苹果的专利里叫History Table)。m1这个表的大小是~630这也就对应了A站的数据。 具体测试方法是适用一条不需要占用物理寄存器但是会更改mapping的指令作为ROB的filler指令,比如zero-cycle mov, mov x5, 0。 我们前面说到ROB最少释放单位是7slot/1 row, 根据maynard Handley的数据每周期可以retire 56条nop指令也就是8row。图源(http://tieba.baidu.com/mo/q/checkurl?url=https%3A%2F%2Fgithub.com%2Fname99-org%2FAArch64-Explore&urlrefer=10bc6402278b423f33bad3203ee924a6 Volume 1) 而History Buffer由于涉及到改变Mapping table,释放的速度就会慢一些16slot/cycle(同时释放物理寄存器的数据也是16/cycle)最后,我想说的是现代商用高性能处理器在不同的层面都做了大量的优化,很多具体实现的技巧课本是都不会说的,想要更深入的了解还是要多读paper,多看看Henry Wong等一系列研究处理器微架构的人的博客。引用Maynard Handley的一句话 “当你读完Hennesy & Patterson 的量化研究的时候你要意识到你仅仅完成了你路途中的10%, 不是90%”。 关于苹果处理器还有很多值得讲的部分比如如何处理指令间的依赖,然后处理OoO内存访问,但是碍于篇幅限制就先写到这里。有机会再继续探讨。
Maynard Handley的ROB大小测试方法 好久没有研究微架构方面的东西了,最近正好A17要出了就研究一下。 Maynard Handley在他的m1逆向工程文档里测试ROB、Int/FP物理寄存器大小和History Queue里都使用了和Henry Wong不同的方法。 由于ROB是FIFO且只有最旧的指令retire了才会释放,所以我们需要用一条或者几条长延时dependence chain(这里他用的是FSQRT)来把ROB head堵住,后面再堆积nop指令。当堆积到一定数量时ROB会被占满,造成流水线stall,再后面的指令必须要等到前面delay chain指令retire才可以继续执行。通过观察在ROB被占满后的行为来判定ROB大小。图上x轴是填充nop指令的数量 y轴是耗费的cycle。蓝色的线代表了有delay chain的情况,黄色代表了没有delay chain下nop的throughput。当填充指令达到~2200的时候执行周期明显变长了,为什么?我们先假设ROB大小没有容量限制,那么执行周期应该要到nop=32*13*8=3328的时候才会开始增长,因为 nop<3228的情况下 最少执行时间由FSQRT的延迟决定, nop>3228的情况下延迟由nop决定, 换句话说执行nop所消耗的时间会比 执行fsqrt消耗的时间更久 所以当fsqrt执行完后nop还没有执行完 如下图红色的线代表了理论上如果ROB没有大小限制的情况。 所以在nop=2200的时候发生了什么导致了不符合理论的情况? 我们假设ROB大小是2200,如果填充指令的数量大于2200,比如说2300,那么在2200条指令后的nop都不可以执行因为当前ROB已经满了,必须要等到ROB head也就是delay chain(Fsqrt)指令retire并且释放ROB entry才可以继续执行。那么,执行2300条nop的时间=执行delay chain指令的时间+执行剩下100条nop指令的时间,作为对比假如ROB大小大于2300,那么执行的时间=执行delay chain指令的时间(也就是图上红色部分),因为假如机器没有stall, 在执行FSQRT的时候会并行的执行nop指令,nop执行完了FSQRT还没有执行完,所以执行时间等于FSQRT的执行时间。最终通过观察何时cycle 会突然增加并且不符合理论情况来确定ROB大小。为了更准确的观察你可以减少delay chain指令的数量,找到上述的现象可以复现下花费最少的延迟指令。如图下最后提供一下测试序列的伪代码 start: FSQRT r2, r2, r0 FSQRT r2, r2, r0 ... FSQRT r2, r2, r0 nop nop .... nop counter-- branch start if not zero 循环是为了减少操作系统上下文切换所带来影响。
关于RISC(arm)和CISC(x86)的指令数量误区 我主要是帮petrification代写的 大家通常可能会认为RISC要用好几条指令才可以完成CISC一条指令 是的我一开始也是这么以为的这是spec 2006的指令数量 spec int 整数平均,ARM指令数比X86略高一 不过主要是hmmer这一项带来的 其他子项有的比X86少,有的差不多,有的多一点 spec fp avg比X86少很多 µ-ops是处理器真正执行的指令也就是微指令在微指令架构的CPU里面,编译器编译出来的机器码和汇编代码并没有发生什么变化。但在指令译码的阶段,指令译码器“翻译”出来的,不再是某一条CPU指令。译码器会把一条机器码,“ 翻译”成好几条“微指令”。这里的一条条微指令,就不再是CISC风格的了,而是变成了固定长度的RISC风格的了。 如果你比较µ-ops的话全是ARM更少 少非常多下面讨论下为什么 这是韦易笑的话: 评论同学觉得cisc指令数多,应该更短,就像写文章,字母越多的语言,写出来的越短,直观想来是这么回事情,问题是x86的cisc太粗糙了,描述逻辑不够细致,举个例子,就只能从源寄存器加到目标寄存器,你要计算x=a+b,就需要两条指令,而arm下面支持源1+源2到另外一个目标,只需要一条指令;再比如说栈操作,x86指令虽多,栈操作却远没有arm那么灵活;x86指令很多都是双操作数,arm指令很多三到四个操作数。 下面给段具体例子:近似计算混色 unsigned int AlphaBlend(unsigned int c1, unsigned int c2) { unsigned char a1 = (c1 >> 24) & 0xff; unsigned char r1 = (c1 >> 16) & 0xff; unsigned char g1 = (c1 >> 8) & 0xff; unsigned char b1 = (c1 & 0xff); unsigned char a2 = (c2 >> 24) & 0xff; unsigned char r2 = (c2 >> 16) & 0xff; unsigned char g2 = (c2 >> 8) & 0xff; unsigned char b2 = (c2 & 0xff); r1 = (r1 * a1 + r2 * (255 - a1)) >> 8; g1 = (g1 * a1 + g2 * (255 - g1)) >> 8; b1 = (b1 * a1 + b2 * (255 - b1)) >> 8; a1 = (a1 + a2 * (255 - a1)) >> 8; return (a1 << 24) | (r1 << 16) | (g1 << 8) | b1; } 同样使用 gcc来编译,同样加上了 -Os 优化选项(为代码尺寸进行优化),结果如下:arm:32条指令,共计128字节x86:60条指令,共计140字节修改为 gcc -O2 优化编译,结果如下:arm:33条指令,132字节x86:55条指令,153字节 x86指令很多都是双操作数,arm指令很多三到四个操作数 ARM从V8开始指令集就变得复杂了。(ARMv4 支持 300 条指令 [14]。在 ARMv8 [22] 中,这个数字已经增长到一千多个。总体而言,ARMv8 很复杂,并且有许多指令格式。) 只有 25% 的 CISC 指令在 95% 的时间内被使用这也是导致RISC指令甚至比CISC少的原因 最后放一下spec2006与EMBEDDED下的 X86和Arm的执行周期对比参考 A_Study_on_the_Impact_of_Instruction_Set_Architectures_on_Process.pdf
1 下一页