i.MX 7 Cortex-M4 memory locations and performance

The NXP i.MX 7 SoC heterogeneous architecture provides a secondary CPU platform with a Cortex-M4 core. This core can be used to run a firmware for custom tasks. The SoC has several options where the firmware can be located: There is a small portion of Tightly Coupled Memory (TCM) close to the Cortex-M4 core. A slightly larger amount of On-Chip SRAM (OCRAM) is available inside the SoC too. The Cortex-M4 core is also able to run from external DDR memory (through the MMDC) and QSPI. Furthermore, the Cortex-M4 uses a Modified Harvard Architecture, which has two independent buses and caches for Code (Code Bus) and Data (System Bus). The memory addressing is still unified, but accesses are split between the buses using addresses as discriminator (addresses in the range 0x00000000-0x1fffffff are loaded through the code bus, 0x20000000-0xdfffffff are accessed through the data bus).

i.MX 7 Simplified Architecture Overview

I was wondering how the different locations and buses affect performance. I used the Hello World example which comes with the NXP NXP i.MX 7 FreeRTOS BSP (I used the Toradex derivation of the BSP). I added a micro benchmark found and forked on Github and used appropriate linker files memory sections (refer to the i.MX 7 Reference Manual for the list of memory addresses) to load the firmware into different memory areas. I did use rather ancient Linaro 4.9 toolchain (2015q3) Since this is a synthetic and very small benchmark, the numbers are likely not directly applicable to real-world applications!

The standard mode of operation should be fetching data through the system bus and code (.text), unsurprisingly, through the code bus. This provides the best performance (assuming equally good caching on the two buses). Since the system bus is somewhat more versatile, I tested how fetching code through the system bus affects performance. The results are execution time in milliseconds. The micro benchmark was highly reproducible and values never deviate more than 1ms between two runs.

Caches Disabled

.data System bus System bus Code bus
.text Code bus System bus Code bus
DDR 1329 ms 1402 ms 1357 ms
OCRAM 401 ms 486 ms
OCRAM_EPD 564 ms 604 ms
OCRAM_S 840 ms 910 ms 861 ms
TCM 45 ms 72 ms

System Bus Cache Enabled

.data System bus System bus
.text Code bus System bus
DDR 1290 ms 72 ms
DDR (non-cached area) 1438 ms 1486 ms
OCRAM 387 ms 72 ms
OCRAM_S 812 ms 93 ms

System and Code Bus Cache Enabled

.data System bus System bus
.text Code bus System bus
DDR 1406 ms 72 ms
DDR (non-cached area) 1437 ms 1486 ms
OCRAM 540 ms 72 ms
OCRAM_S 66 ms 93 ms

Observations

  • Unsurprisingly TCM is the fastest memory. There seem to be a difference in access times between OCRAM areas.
  • The DDR memory area which can be cached is limited to the first two megabyte according to the i.MX 7 Reference Manual (see note in chapter 4.2.9.3.5 Cache Function, 0x80000000-0x801fffff). However, in tests it seems that the first four megabytes are cacheable (0x80000000-0x803fffff). Everything after 0x80400000 seems to be definitely uncached (row DDR non-cached in the results)
  • Fetching Code through a cached System Bus is much faster than fetching code through an uncached Code Bus.
  • It seems that Code Bus Cache does not work for DDR and OCRAM, which is somewhat unfortunate. The above mentioned chapter even suggests that caches cannot be used for any code bus memory region (??).
  • The test was run using the default MPU cache settings. Changing the MPU cache bits had no impact in most measurements…

Note that when running from TCM caches do not affect performance, since the Cortex-M4 already has access to TCM with zero wait-states. In fact, the Cortex-M4 block diagram in the i.MX 7 Reference Manual suggests that access to the TCM does not even reach the cache controller. Verification measurements also showed that running a firmware from TCM with caches enabled were exactly the same as without cached.

Firmware from DDR memory whiel running Linux

The cacheable DDR region is located rather unfortunate when running Linux on the primary cores: The Linux kernel program code by default gets unpacked to 0x80008000, and is not relocated subsequently, hence the area is occupied by the Kernel. It is possible to move the text base by adjusting textofs-y in arch/arm/Makefile. Setting it to 0x00208000 puts the kernel one megabyte into DDR memory (see .text section):

[ 0.000000] Virtual kernel memory layout:
[ 0.000000] vector : 0xffff0000 - 0xffff1000    ( 4 kB)
[ 0.000000] fixmap : 0xffc00000 - 0xfff00000  (3072 kB)
[ 0.000000] vmalloc : 0xa0800000 - 0xff000000 (1512 MB)
[ 0.000000] lowmem : 0x80000000 - 0xa0000000  ( 512 MB)
[ 0.000000] modules : 0x7f000000 - 0x80000000  ( 16 MB)
[ 0.000000]   .text : 0x80208000 - 0x80a7d820 (8663 kB)
[ 0.000000]   .init : 0x80a7e000 - 0x80ace000 ( 320 kB)
[ 0.000000]   .data : 0x80ace000 - 0x80b164c0 ( 290 kB)
[ 0.000000]    .bss : 0x80b19000 - 0x80b7838c ( 381 kB)

 

reserved-memory {
        #address-cells = <1>;
        #size-cells = <1>;
        ranges;

        cortexm4@80000000 {
                reg = <0x80000000 0x200000>;
        };
};

In another test I tried to make use of the area using the continuous memory allocator (CMA). I had to set FORCE_MAX_ZONEORDER in arch/arm/Kconfig to 10 to allow CMA allocations of just two megabytes. I then could reserve the area for the CMA allocator using this device tree entry:

linux,cma {
        compatible = "shared-dma-pool";
        reusable; 
        size = <0x200000>;
        linux,cma-default;
        alloc-ranges = <0x80000000 0x200000>;
};

 

Leave a Reply

Your email address will not be published. Required fields are marked *