summaryrefslogtreecommitdiffstats
path: root/c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S
diff options
context:
space:
mode:
authorJoel Sherrill <joel.sherrill@OARcorp.com>1999-06-14 16:51:13 +0000
committerJoel Sherrill <joel.sherrill@OARcorp.com>1999-06-14 16:51:13 +0000
commitba46ffa6169c0927c19d97816286b5ffaf2e9ab2 (patch)
tree2d71e9fa43bed5fe628a202df8710772b7ddb037 /c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S
parentRegenerated. (diff)
downloadrtems-ba46ffa6169c0927c19d97816286b5ffaf2e9ab2.tar.bz2
This is a large patch from Eric Valette <valette@crf.canon.fr> that was
described in the message following this paragraph. This patch also includes a mcp750 BSP. From valette@crf.canon.fr Mon Jun 14 10:03:08 1999 Date: Tue, 18 May 1999 01:30:14 +0200 (CEST) From: VALETTE Eric <valette@crf.canon.fr> To: joel@oarcorp.com Cc: raguet@crf.canon.fr, rtems-snapshots@oarcorp.com, valette@crf.canon.fr Subject: Questions/Suggestion regarding RTEMS PowerPC code (long) Dear knowledgeable RTEMS powerpc users, As some of you may know, I'm currently finalizing a port of RTEMS on a MCP750 Motorola board. I have done most of it but have some questions to ask before submitting the port. In order to understand some of the changes I have made or would like to make, maybe it is worth describing the MCP750 Motorola board. the MCP750 is a COMPACT PCI powerpc board with : 1) a MPC750 233 MHz processor, 2) a raven bus bridge/PCI controller that implement an OPENPIC compliant interrupt controller, 3) a VIA 82C586 PCI/ISA bridge that offers a PC compliant IO for keyboard, serial line, IDE, and the well known PC 8259 cascaded PIC interrupt architecture model, 4) a DEC 21140 Ethernet controller, 5) the PPCBUG Motorola firmware in flash, 6) A DEC PCI bridge, This architecture is common to most Motorola 60x/7xx board except that : 1) on VME board, the DEC PCI bridge is replaced by a VME chipset, 2) the VIA 82C586 PCI/ISA bridge is replaced by another bridge that is almost fully compatible with the via bridge... So the port should be a rather close basis for many 60x/7xx motorola board... On this board, I already have ported Linux 2.2.3 and use it both as a development and target board. Now the questions/suggestions I have : 1) EXCEPTION CODE ------------------- As far as I know exceptions on PPC are handled like interrupts. I dislike this very much as : a) Except for the decrementer exception (and maybe some other on mpc8xx), exceptions are not recoverable and the handler just need to print the full context and go to the firmware or debugger... b) The interrupt switch is only necessary for the decrementer and external interrupt (at least on 6xx,7xx). c) The full context for exception is never saved and thus cannot be used by debugger... I do understand the most important for interrupts low level code is to save the minimal context enabling to call C code for performance reasons. On non recoverable exception on the other hand, the most important is to save the maximum information concerning proc status in order to analyze the reason of the fault. At least we will need this in order to implement the port of RGDB on PPC ==> I wrote an API for connecting raw exceptions (and thus raw interrupts) for mpc750. It should be valid for most powerpc processors... I hope to find a way to make this coexist with actual code layout. The code is actually located in lib/libcpu/powerpc/mpc750 and is thus optional (provided I write my own version of exec/score/cpu/powerpc/cpu.c ...) See remark about files/directory layout organization in 4) 2) Current Implementation of ISR low level code ----------------------------------------------- I do not understand why the MSR EE flags is cleared again in exec/score/cpu/powerpc/irq_stubs.S #if (PPC_USE_SPRG) mfmsr r5 mfspr r6, sprg2 #else lwz r6,msr_initial(r11) lis r5,~PPC_MSR_DISABLE_MASK@ha ori r5,r5,~PPC_MSR_DISABLE_MASK@l and r6,r6,r5 mfmsr r5 #endif Reading the doc, when a decrementer interrupt or an external interrupt is active, the MSR EE flag is already cleared. BTW if exception/interrupt could occur, it would trash SRR0 and SRR1. In fact the code may be useful to set MSR[RI] that re-enables exception processing. BTW I will need to set other value in MSR to handle interrupts : a) I want the MSR[IR] and MSR[DR] to be set for performance reasons and also because I need DBAT support to have access to PCI memory space as the interrupt controller is in the PCI space. Reading the code, I see others have the same kind of request : /* SCE 980217 * * We need address translation ON when we call our ISR routine mtmsr r5 */ This is just another prof that even the lowest level IRQ code is fundamentally board dependent and not simply processor dependent especially when the processor use external interrupt controller because it has a single interrupt request line... Note that if you look at the PPC code high level interrupt handling code, as the "set_vector" routine that really connects the interrupt is in the BSP/startup/genpvec.c, the fact that IRQ handling is BSP specific is DE-FACTO acknowledged. I know I have already expressed this and understand that this would require some heavy change in the code but believe me you will reach a point where you will not be able to find a compatible while optimum implementation for low level interrupt handling code...) In my case this is already true... So please consider removing low level IRQ handling from exec/score/cpu/* and only let there exception handling code... Exceptions are usually only processor dependent and do not depend on external hardware mechanism to be masked or acknowledged or re-enabled (there are probably exception but ...) I have already done this for pc386 bsp but need to make it again. This time I will even propose an API. 3) R2/R13 manipulation for EABI implementation ---------------------------------------------- I do not understand the handling of r2 and r13 in the EABI case. The specification for r2 says pointer to sdata2, sbss2 section => constant. However I do not see -ffixed-r2 passed to any compilation system in make/custom/* (for info linux does this on PPC). So either this is a default compiler option when choosing powerpc-rtems and thus we do not need to do anything with this register as all the code is compiled with this compiler and linked together OR this register may be used by rtems code and then we do not need any special initialization or handling. The specification for r13 says pointer to the small data area. r13 argumentation is the same except that as far as I know the usage of the small data area requires specific compiler support so that access to variables is compiled via loading the LSB in a register and then using r13 to get full address... It is like a small memory model and it was present in IBM C compilers. => I propose to suppress any specific code for r2 and r13 in the EABI case. 4) Code layout organization (yes again :-)) ------------------------------------------- I think there are a number of design flaws in the way the code is for ppc organized and I will try to point them out. I have been beaten by this again on this new port, and was beaten last year while modifying code for pc386. a) exec/score/cpu/* vs lib/libcpu/cpu/*. I think that too many things are put in exec/score/cpu that have nothing to do with RTEMS internals but are rather related to CPU feature. This include at least : a) registers access routine (e.g GET_MSR_Value), b) interrupt masking/unmasking routines, c) cache_mngt_routine, d) mmu_mngt_routine, e) Routines to connect the raw_exception, raw_interrupt handler, b) lib/libcpu/cpu/powerpc/* With a processor family as exuberant as the powerpc family, and their well known subtle differences (604 vs 750) or unfortunately majors (8xx vs 60x) the directory structure is fine (except maybe the names that are not homogeneous) powerpc ppc421 mpc821 ... I only needed to add mpc750. But the fact that libcpu.a was not produced was a pain and the fact that this organization may duplicates code is also problematic. So, except if the support of automake provides a better solution I would like to propose something like this : powerpc mpc421 mpc821 ... mpc750 shared wrapup with the following rules : a) "shared" would act as a source container for sources that may be shared among processors. Needed files would be compiled inside the processor specific directory using the vpath Makefile mechanism. "shared" may also contain compilation code for routine that are really shared and not worth to inline... (did not found many things so far as registers access routine ARE WORTH INLINING)... In the case something is compiled there, it should create libcpushared.a b) layout under processor specific directory is free provided that 1)the result of the compilation process exports : libcpu/powerpc/"PROC"/*.h in $(PROJECT_INCLUDE)/libcpu 2) each processor specific directory creates a library called libcpuspecific.a Note that this organization enables to have a file that is nearly the same than in shared but that must differ because of processor differences... c) "wrapup" should create libcpu.a using libcpushared.a libcpuspecific.a and export it $(PROJECT_INCLUDE)/libcpu The only thing I have no ideal solution is the way to put shared definitions in "shared" and only processor specific definition in "proc". To give a concrete example, most MSR bit definition are shared among PPC processors and only some differs. if we create a single msr.h in shared it will have ifdef. If in msr.h we include libcpu/msr_c.h we will need to have it in each prowerpc specific directory (even empty). Opinions are welcomed ... Note that a similar mechanism exist in libbsp/i386 that also contains a shared directory that is used by several bsp like pc386 and i386ex and a similar wrapup mechanism... NB: I have done this for mpc750 and other processors could just use similar Makefiles... c) The exec/score/cpu/powerpc directory layout. I think the directory layout should be the same than the libcpu/powerpc. As it is not, there are a lot of ifdefs inside the code... And of course low level interrupt handling code should be removed... Besides that I do not understand why 1) things are compiled in the wrap directory, 2) some includes are moved to rtems/score, I think the "preinstall" mechanism enables to put everything in the current directory (or better in a per processor directory), 5) Interrupt handling API ------------------------- Again :-). But I think that using all the features the PIC offers is a MUST for RT system. I already explained in the prologue of this (long and probably boring) mail that the MCP750 boards offers an OPENPIC compliant architecture and that the VIA 82586 PCI/ISA bridge offers a PC compatible IO and PIC mapping. Here is a logical view of the RAVEN/VIA 82586 interrupt mapping : --------- 0 ------ | OPEN | <-----|8259| | PIC | | | 2 ------ |(RAVEN)| | | <-----|8259| | | | | | | 11 | | | | | | <---- | | | | | | | | | | | | --------- ------ | | ^ ------ | VIA PCI/ISA bridge | x -------- PCI interrupts OPENPIC offers interrupt priorities among PCI interrupts and interrupt selective masking. The 8259 offers the same kind of feature. With actual powerpc interrupt code : 1) there is no way to specify priorities among interrupts handler. This is REALLY a bad thing. For me it is as importnat as having priorities for threads... 2) for my implementation, each ISR should contain the code that acknowledge the RAVEN and 8259 cascade, modify interrupt mask on both chips, and reenable interrupt at processor level, ..., restore then on interrupt return,.... This code is actually similar to code located in some genpvec.c powerpc files, 3) I must update _ISR_Nesting_level because irq.inl use it... 4) the libchip code connects the ISR via set_vector but the libchip handler code does not contain any code to manipulate external interrupt controller hardware in order to acknoledge the interrupt or re-enable them (except for the target hardware of course) So this code is broken unless set_vector adds an additionnal prologue/epilogue before calling/returning from in order to acknoledge/mask the raven and the 8259 PICS... => Anyway already EACH BSP MUST REWRITE PART OF INTERRUPT HANDLING CODE TO CORRECTLY IMPLEMENT SET_VECTOR. I would rather offer an API similar to the one provided in libbsp/i386/shared/irq/irq.h so that : 1) Once the driver supplied methods is called the only things the ISR has to do is to worry about the external hardware that triggered the interrupt. Everything on openpic/VIA/processor would have been done by the low levels (same things as set-vector) 2) The caller will need to supply the on/off/isOn routine that are fundamental to correctly implements debuggers/performance monitoring is a portable way 3) A globally configurable interrupt priorities mechanism... I have nothing against providing a compatible set_vector just to make libchip happy but as I have already explained in other mails (months ago), I really think that the ISR connection should be handled by the BSP and that no code containing irq connection should exist the rtems generic layers... Thus I really dislike libchip on this aspect because in a long term it will force to adopt the less reach API for interrupt handling that exists (set_vector). Additional note : I think the _ISR_Is_in_progress() inline routine should be : 1) Put in a processor specific section, 2) Should not rely on a global variable, As : a) on symmetric MP, there is one interrupt level per CPU, b) On processor that have an ISP (e,g 68040), this variable is useless (MSR bit testing could be used) c) On PPC, instead of using the address of the variable via __CPU_IRQ_info.Nest_level a dedicated SPR could be used. NOTE: most of this is also true for _Thread_Dispatch_disable_level END NOTE -------- Please do not take what I said in the mail as a criticism for anyone who submitted ppc code. Any code present helped me a lot understanding PPC behavior. I just wanted by this mail to : 1) try to better understand the actual code, 2) propose concrete ways of enhancing current code by providing an alternative implementation for MCP750. I will make my best effort to try to brake nothing but this is actually hard due to the file layout organisation. 3) make understandable some changes I will probably make if joel let me do them :-) Any comments/objections are welcomed as usual. -- __ / ` Eric Valette /-- __ o _. Canon CRF (___, / (_(_(__ Rue de la touche lambert 35517 Cesson-Sevigne Cedex FRANCE Tel: +33 (0)2 99 87 68 91 Fax: +33 (0)2 99 84 11 30 E-mail: valette@crf.canon.fr
Diffstat (limited to 'c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S')
-rw-r--r--c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S466
1 files changed, 466 insertions, 0 deletions
diff --git a/c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S b/c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S
new file mode 100644
index 0000000000..cf539cf8d2
--- /dev/null
+++ b/c/src/lib/libbsp/powerpc/mcp750/bootloader/exception.S
@@ -0,0 +1,466 @@
+/*
+ * arch/ppc/loader/exceotion.S -- Exception handlers for early boot.
+ *
+ * Copyright (C) 1998 Gabriel Paubert, paubert@iram.es
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of this archive
+ * for more details.
+ */
+
+/* This is an improved version of the TLB interrupt handling code from
+ * the 603e users manual (603eUM.pdf) downloaded from the WWW. All the
+ * visible bugs have been removed. Note that many have survived in the errata
+ * to the 603 user manual (603UMer.pdf).
+ *
+ * This code also pays particular attention to optimization, takes into
+ * account the differences between 603 and 603e, single/multiple processor
+ * systems and tries to order instructions for dual dispatch in many places.
+ *
+ * The optimization has been performed along two lines:
+ * 1) to minimize the number of instruction cache lines needed for the most
+ * common execution paths (the ones that do not result in an exception).
+ * 2) then to order the code to maximize the number of dual issue and
+ * completion opportunities without increasing the number of cache lines
+ * used in the same cases.
+ *
+ * The last goal of this code is to fit inside the address range
+ * assigned to the interrupt vectors: 192 instructions with fixed
+ * entry points every 64 instructions.
+ *
+ * Some typos have also been corrected and the Power l (lowercase L)
+ * instructions replaced by lwz without comment.
+ *
+ * I have attempted to describe the reasons of the order and of the choice
+ * of the instructions but the comments may be hard to understand without
+ * the processor manual.
+ *
+ * Note that the fact that the TLB are reloaded by software in theory
+ * allows tremendous flexibility, for example we could avoid setting the
+ * reference bit of the PTE which will could actually not be accessed because
+ * of protection violation by changing a few lines of code. However,
+ * this would significantly slow down most TLB reload operations, and
+ * this is the reason for which we try never to make checks which would be
+ * redundant with hardware and usually indicate a bug in a program.
+ *
+ * There are some inconsistencies in the documentation concerning the
+ * settings of SRR1 bit 15. All recent documentations say now that it is set
+ * for stores and cleared for loads. Anyway this handler never uses this bit.
+ *
+ * A final remark, the rfi instruction seems to implicitly clear the
+ * MSR<14> (tgpr)bit. The documentation claims that this bit is restored
+ * from SRR1 by rfi, but the corresponding bit in SRR1 is the LRU way bit.
+ * Anyway, the only exception which can occur while TGPR is set is a machine
+ * check which would indicate an unrecoverable problem. Recent documentation
+ * now says in some place that rfi clears MSR<14>.
+ *
+ * TLB software load for 602/603/603e/603ev:
+ * Specific Instructions:
+ * tlbld - write the dtlb with the pte in rpa reg
+ * tlbli - write the itlb with the pte in rpa reg
+ * Specific SPRs:
+ * dmiss - address of dstream miss
+ * imiss - address of istream miss
+ * hash1 - address primary hash PTEG address
+ * hash2 - returns secondary hash PTEG address
+ * iCmp - returns the primary istream compare value
+ * dCmp - returns the primary dstream compare value
+ * rpa - the second word of pte used by tlblx
+ * Other specific resources:
+ * cr0 saved in 4 high order bits of SRR1,
+ * SRR1 bit 14 [WAY] selects TLB set to load from LRU algorithm
+ * gprs r0..r3 shadowed by the setting of MSR bit 14 [TGPR]
+ * other bits in SRR1 (unused by this handler but see earlier comments)
+ *
+ * There are three basic flows corresponding to three vectors:
+ * 0x1000: Instruction TLB miss,
+ * 0x1100: Data TLB miss on load,
+ * 0x1200: Data TLB miss on store or not dirty page
+ */
+
+/* define the following if code does not have to run on basic 603 */
+/* #define USE_KEY_BIT */
+
+/* define the following for safe multiprocessing */
+/* #define MULTIPROCESSING */
+
+/* define the following for mixed endian */
+/* #define CHECK_MIXED_ENDIAN */
+
+/* define the following if entries always have the reference bit set */
+#define ASSUME_REF_SET
+
+/* Some OS kernels may want to keep a single copy of the dirty bit in a per
+ * page table. In this case writable pages are always write-protected as long
+ * as they are clean, and the dirty bit set actually means that the page
+ * is writable.
+ */
+#define DIRTY_MEANS_WRITABLE
+
+#include <libcpu/cpu.h>
+#include "asm.h"
+#include "bootldr.h"
+
+/*
+ * Instruction TLB miss flow
+ * Entry at 0x1000 with the following:
+ * srr0 -> address of instruction that missed
+ * srr1 -> 0:3=cr0, 13=1 (instruction), 14=lru way, 16:31=saved MSR
+ * msr<tgpr> -> 1
+ * iMiss -> ea that missed
+ * iCmp -> the compare value for the va that missed
+ * hash1 -> pointer to first hash pteg
+ * hash2 -> pointer to second hash pteg
+ *
+ * Register usage:
+ * r0 is limit address during search / scratch after
+ * r1 is pte data / error code for ISI exception when search fails
+ * r2 is pointer to pte
+ * r3 is compare value during search / scratch after
+ */
+/* Binutils or assembler bug ? Declaring the section executable and writable
+ * generates an error message on the @fixup entries.
+ */
+ .section .exception,"aw"
+# .org 0x1000 # instruction TLB miss entry point
+ .globl tlb_handlers
+tlb_handlers:
+ .type tlb_handlers,@function
+#define ISIVec tlb_handlers-0x1000+0x400
+#define DSIVec tlb_handlers-0x1000+0x300
+ mfspr r2,HASH1
+ lwz r1,0(r2) # Start memory access as soon as possible
+ mfspr r3,ICMP # to load the cache.
+0: la r0,48(r2) # Use explicit loop to avoid using ctr
+1: cmpw r1,r3 # In theory the loop is somewhat slower
+ beq- 2f # than documentation example
+ cmpw r0,r2 # but we gain from starting cache load
+ lwzu r1,8(r2) # earlier and using slots between load
+ bne+ 1b # and comparison for other purposes.
+ cmpw r1,r3
+ bne- 4f # Secondary hash check
+2: lwz r1,4(r2) # Found: load second word of PTE
+ mfspr r0,IMISS # get miss address during load delay
+#ifdef ASSUME_REF_SET
+ andi. r3,r1,8 # check for guarded memory
+ bne- 5f
+ mtspr RPA,r1
+ mfsrr1 r3
+ tlbli r0
+#else
+/* This is basically the original code from the manual. */
+# andi. r3,r1,8 # check for guarded memory
+# bne- 5f
+# andi. r3,r1,0x100 # check R bit ahead to help folding
+/* However there is a better solution: these last three instructions can be
+replaced by the following which should cause less pipeline stalls because
+both tests are combined and there is a single CR rename buffer */
+ extlwi r3,r1,6,23 # Keep only RCWIMG in 6 most significant bits.
+ rlwinm. r3,r3,5,0,27 # Keep only G (in sign) and R and test.
+ blt- 5f # Negative means guarded, zero R not set.
+ mfsrr1 r3 # get saved cr0 bits now to dual issue
+ ori r1,r1,0x100
+ mtspr RPA,r1
+ tlbli r0
+/* Do not update PTE if R bit already set, this will save one cache line
+writeback at a later time, and avoid even more bus traffic in
+multiprocessing systems, when several processors access the same PTEGs.
+We also hope that the reference bit will be already set. */
+ bne+ 3f
+#ifdef MULTIPROCESSING
+ srwi r1,r1,8 # get byte 7 of pte
+ stb r1,+6(r2) # update page table
+#else
+ sth r1,+6(r2) # update page table
+#endif
+#endif
+3: mtcrf 0x80,r3 # restore CR0
+ rfi # return to executing program
+
+/* The preceding code is 20 to 25 instructions long, which occupies
+3 or 4 cache lines. */
+4: andi. r0,r3,0x0040 # see if we have done second hash
+ lis r1,0x4000 # set up error code in case next branch taken
+ bne- 6f # speculatively issue the following
+ mfspr r2,HASH2 # get the second pointer
+ ori r3,r3,0x0040 # change the compare value
+ lwz r1,0(r2) # load first entry
+ b 0b # and go back to main loop
+/* We are now at 27 to 32 instructions, using 3 or 4 cache lines for all
+cases in which the TLB is successfully loaded. */
+
+/* Guarded memory protection violation: synthesize an ISI exception. */
+5: lis r1,0x1000 # set srr1<3>=1 to flag guard violation
+/* Entry Not Found branches here with r1 correctly set. */
+6: mfsrr1 r3
+ mfmsr r0
+ insrwi r1,r3,16,16 # build srr1 for ISI exception
+ mtsrr1 r1 # set srr1
+/* It seems few people have realized rlwinm can be used to clear a bit or
+a field of contiguous bits in a register by setting mask_begin>mask_end. */
+ rlwinm r0,r0,0,15,13 # clear the msr<tgpr> bit
+ mtcrf 0x80, r3 # restore CR0
+ mtmsr r0 # flip back to the native gprs
+ isync # Required from 602 doc!
+ b ISIVec # go to instruction access exception
+/* Up to now there are 37 to 42 instructions so at least 20 could be
+inserted for complex cases or for statistics recording. */
+
+
+/*
+ Data TLB miss on load flow
+ Entry at 0x1100 with the following:
+ srr0 -> address of instruction that caused the miss
+ srr1 -> 0:3=cr0, 13=0 (data), 14=lru way, 15=0, 16:31=saved MSR
+ msr<tgpr> -> 1
+ dMiss -> ea that missed
+ dCmp -> the compare value for the va that missed
+ hash1 -> pointer to first hash pteg
+ hash2 -> pointer to second hash pteg
+
+ Register usage:
+ r0 is limit address during search / scratch after
+ r1 is pte data / error code for DSI exception when search fails
+ r2 is pointer to pte
+ r3 is compare value during search / scratch after
+*/
+ .org tlb_handlers+0x100
+ mfspr r2,HASH1
+ lwz r1,0(r2) # Start memory access as soon as possible
+ mfspr r3,DCMP # to load the cache.
+0: la r0,48(r2) # Use explicit loop to avoid using ctr
+1: cmpw r1,r3 # In theory the loop is somewhat slower
+ beq- 2f # than documentation example
+ cmpw r0,r2 # but we gain from starting cache load
+ lwzu r1,8(r2) # earlier and using slots between load
+ bne+ 1b # and comparison for other purposes.
+ cmpw r1,r3
+ bne- 4f # Secondary hash check
+2: lwz r1,4(r2) # Found: load second word of PTE
+ mfspr r0,DMISS # get miss address during load delay
+#ifdef ASSUME_REF_SET
+ mtspr RPA,r1
+ mfsrr1 r3
+ tlbld r0
+#else
+ andi. r3,r1,0x100 # check R bit ahead to help folding
+ mfsrr1 r3 # get saved cr0 bits now to dual issue
+ ori r1,r1,0x100
+ mtspr RPA,r1
+ tlbld r0
+/* Do not update PTE if R bit already set, this will save one cache line
+writeback at a later time, and avoid even more bus traffic in
+multiprocessing systems, when several processors access the same PTEGs.
+We also hope that the reference bit will be already set. */
+ bne+ 3f
+#ifdef MULTIPROCESSING
+ srwi r1,r1,8 # get byte 7 of pte
+ stb r1,+6(r2) # update page table
+#else
+ sth r1,+6(r2) # update page table
+#endif
+#endif
+3: mtcrf 0x80,r3 # restore CR0
+ rfi # return to executing program
+
+/* The preceding code is 18 to 23 instructions long, which occupies
+3 cache lines. */
+4: andi. r0,r3,0x0040 # see if we have done second hash
+ lis r1,0x4000 # set up error code in case next branch taken
+ bne- 9f # speculatively issue the following
+ mfspr r2,HASH2 # get the second pointer
+ ori r3,r3,0x0040 # change the compare value
+ lwz r1,0(r2) # load first entry asap
+ b 0b # and go back to main loop
+/* We are now at 25 to 30 instructions, using 3 or 4 cache lines for all
+cases in which the TLB is successfully loaded. */
+
+
+/*
+ Data TLB miss on store or not dirty page flow
+ Entry at 0x1200 with the following:
+ srr0 -> address of instruction that caused the miss
+ srr1 -> 0:3=cr0, 13=0 (data), 14=lru way, 15=1, 16:31=saved MSR
+ msr<tgpr> -> 1
+ dMiss -> ea that missed
+ dCmp -> the compare value for the va that missed
+ hash1 -> pointer to first hash pteg
+ hash2 -> pointer to second hash pteg
+
+ Register usage:
+ r0 is limit address during search / scratch after
+ r1 is pte data / error code for DSI exception when search fails
+ r2 is pointer to pte
+ r3 is compare value during search / scratch after
+*/
+ .org tlb_handlers+0x200
+ mfspr r2,HASH1
+ lwz r1,0(r2) # Start memory access as soon as possible
+ mfspr r3,DCMP # to load the cache.
+0: la r0,48(r2) # Use explicit loop to avoid using ctr
+1: cmpw r1,r3 # In theory the loop is somewhat slower
+ beq- 2f # than documentation example
+ cmpw r0,r2 # but we gain from starting cache load
+ lwzu r1,8(r2) # earlier and using slots between load
+ bne+ 1b # and comparison for other purposes.
+ cmpw r1,r3
+ bne- 4f # Secondary hash check
+2: lwz r1,4(r2) # Found: load second word of PTE
+ mfspr r0,DMISS # get miss address during load delay
+/* We could simply set the C bit and then rely on hardware to flag protection
+violations. This raises the problem that a page which actually has not been
+modified may be marked as dirty and violates the OEA model for guaranteed
+bit settings (table 5-8 of 603eUM.pdf). This can have harmful consequences
+on operating system memory management routines, and play havoc with copy on
+write schemes. So the protection check is ABSOLUTELY necessary. */
+ andi. r3,r1,0x80 # check C bit
+ beq- 5f # if (C==0) go to check protection
+3: mfsrr1 r3 # get the saved cr0 bits
+ mtspr RPA,r1 # set the pte
+ tlbld r0 # load the dtlb
+ mtcrf 0x80,r3 # restore CR0
+ rfi # return to executing program
+/* The preceding code is 20 instructions long, which occupy
+3 cache lines. */
+4: andi. r0,r3,0x0040 # see if we have done second hash
+ lis r1,0x4200 # set up error code in case next branch taken
+ bne- 9f # speculatively issue the following
+ mfspr r2,HASH2 # get the second pointer
+ ori r3,r3,0x0040 # change the compare value
+ lwz r1,0(r2) # load first entry asap
+ b 0b # and go back to main loop
+/* We are now at 27 instructions, using 3 or 4 cache lines for all
+cases in which the TLB C bit is already set. */
+
+#ifdef DIRTY_MEANS_WRITABLE
+5: lis r1,0x0A00 # protection violation on store
+#else
+/*
+ Entry found and C==0: check protection before setting C:
+ Register usage:
+ r0 is dMiss register
+ r1 is PTE entry (to be copied to RPA if success)
+ r2 is pointer to pte
+ r3 is trashed
+
+ For the 603e, the key bit in SRR1 helps to decide whether there is a
+ protection violation. However the way the check is done in the manual is
+ not very efficient. The code shown here works as well for 603 and 603e and
+ is much more efficient for the 603 and comparable to the manual example
+ for 603e. This code however has quite a bad structure due to the fact it
+ has been reordered to speed up the most common cases.
+*/
+/* The first of the following two instructions could be replaced by
+andi. r3,r1,3 but it would compete with cmplwi for cr0 resource. */
+5: clrlwi r3,r1,30 # Extract two low order bits
+ cmplwi r3,2 # Test for PP=10
+ bne- 7f # assume fallthrough is more frequent
+6: ori r1,r1,0x180 # set referenced and changed bit
+ sth r1,6(r2) # update page table
+ b 3b # and finish loading TLB
+/* We are now at 33 instructions, using 5 cache lines. */
+7: bgt- 8f # if PP=11 then DSI protection exception
+/* This code only works if key bit is present (602/603e/603ev) */
+#ifdef USE_KEY_BIT
+ mfsrr1 r3 # get the KEY bit and test it
+ andis. r3,r3,0x0008
+ beq 6b # default prediction taken, truly better ?
+#else
+/* This code is for all 602 and 603 family models: */
+ mfsrr1 r3 # Here the trick is to use the MSR PR bit as a
+ mfsrin r0,r0 # shift count for an rlwnm. instruction which
+ extrwi r3,r3,1,17 # extracts and tests the correct key bit from
+ rlwnm. r3,r0,r3,1,1 # the segment register. RISC they said...
+ mfspr r0,DMISS # Restore fault address to r0
+ beq 6b # if 0 load tlb else protection fault
+#endif
+/* We are now at 40 instructions, (37 if using key bit), using 5 cache
+lines in all cases in which the C bit is successfully set */
+8: lis r1,0x0A00 # protection violation on store
+#endif /* DIRTY_IS_WRITABLE */
+/* PTE entry not found branch here with DSISR code in r1 */
+9: mfsrr1 r3
+ mtdsisr r1
+ clrlwi r2,r3,16 # set up srr1 for DSI exception
+ mfmsr r0
+/* I have some doubts about the usefulness of the xori instruction in
+mixed or pure little-endian environment. The address is in the same
+doubleword, hence in the same protection domain and performing an exclusive
+or with 7 is only valid for byte accesses. */
+#ifdef CHECK_MIXED_ENDIAN
+ andi. r1,r2,1 # test LE bit ahead to help folding
+#endif
+ mtsrr1 r2
+ rlwinm r0,r0,0,15,13 # clear the msr<tgpr> bit
+ mfspr r1,DMISS # get miss address
+#ifdef CHECK_MIXED_ENDIAN
+ beq 1f # if little endian then:
+ xori r1,r1,0x07 # de-mung the data address
+1:
+#endif
+ mtdar r1 # put in dar
+ mtcrf 0x80,r3 # restore CR0
+ mtmsr r0 # flip back to the native gprs
+ isync # required from 602 manual
+ b DSIVec # branch to DSI exception
+/* We are now between 50 and 56 instructions. Close to the limit
+but should be sufficient in case bugs are found. */
+/* Altogether the three handlers occupy 128 instructions in the worst
+case, 64 instructions could still be added (non contiguously). */
+ .org tlb_handlers+0x300
+ .globl _handler_glue
+_handler_glue:
+/* Entry code for exceptions: DSI (0x300), ISI(0x400), alignment(0x600) and
+ * traps(0x700). In theory it is not necessary to save and restore r13 and all
+ * higher numbered registers, but it is done because it allowed to call the
+ * firmware (PPCBug) for debugging in the very first stages when writing the
+ * bootloader.
+ */
+ stwu r1,-160(r1)
+ stw r0,save_r(0)
+ mflr r0
+ stmw r2,save_r(2)
+ bl 0f
+0: mfctr r4
+ stw r0,save_lr
+ mflr r9 /* Interrupt vector + few instructions */
+ la r10,160(r1)
+ stw r4,save_ctr
+ mfcr r5
+ lwz r8,2f-0b(r9)
+ mfxer r6
+ stw r5,save_cr
+ mtctr r8
+ stw r6,save_xer
+ mfsrr0 r7
+ stw r10,save_r(1)
+ mfsrr1 r8
+ stw r7,save_nip
+ la r4,8(r1)
+ lwz r13,1f-0b(r9)
+ rlwinm r3,r9,24,0x3f /* Interrupt vector >> 8 */
+ stw r8,save_msr
+ bctrl
+
+ lwz r7,save_msr
+ lwz r6,save_nip
+ mtsrr1 r7
+ lwz r5,save_xer
+ mtsrr0 r6
+ lwz r4,save_ctr
+ mtxer r5
+ lwz r3,save_lr
+ mtctr r4
+ lwz r0,save_cr
+ mtlr r3
+ lmw r2,save_r(2)
+ mtcr r0
+ lwz r0,save_r(0)
+ la r1,160(r1)
+ rfi
+1: .long (__bd)@fixup
+2: .long (_handler)@fixup
+ .section .fixup,"aw"
+ .align 2
+ .long 1b, 2b
+ .previous