Linux integrity monitoring on the Raspberry Pi

This blog post, written by István Telek, is the last post in a series of blog posts on transforming the Raspberry Pi into a security enhanced IoT platform. This last post describes how we implemented a very basic integrity monitoring function as a trusted application running in OP-TEE.

Introduction

Runtime integrity monitoring can enhance the security of our system by providing us information about possible failures, misconfigurations and break-in attempts. OP-TEE and ARM TrustZone technology provides a secure way to map certain memory regions and check their contents. In our example, we use this feature to obtain, in the secure world OP-TEE, the list of processes running in the normal world OS. One can then use a whitelisting technique to reliably detect any unknown or unwanted processes running in the normal world OS.

We explain our work in 6 main steps below:

Linux process list structure
Linux kernel memory addressing
Implementation as a Linux kernel module
Mapping Linux kernel memory from OP-TEE
Implementation as a Trusted Application
Creating a client application for our TA

1. Linux process list structure

Linux processes are represented by task_struct structures in kernel memory (see Linux kernel 4.6.3 task_struct structure source code). An instance of this structure is around 3 KiB in size and it contains a lot of information about the particular running process. These structures are stored in the kernel memory cache (kmem_cache).

Key points of the structure:

pid_t pid; Contains the PID of the process
struct list_head children; Holds a circular double linked list with the children of the given process.
struct list_head sibling; Holds a linkage to the parent’s children.
char comm[16]; Contains the name of the process. If a kernel process’ name ends with a / character followed by a number then the number indicates which CPU/core the thread is running on.

For our purposes, we are using a minimal task_struct structure called my_task_struct which only has these key members:

struct my_list_head {
    struct my_list_head *next, *prev;
};
struct my_task_struct {
  char __padding_01[1056];
  int pid;                      // 'pid_t' is an alias for '__kernel_pid_t' which is 'int' in kernel 4.6.3
  char __padding_02[20];
  struct my_list_head children;
  struct my_list_head sibling;
  char __padding_03[368];
  char comm[16];                // zero terminated string
};

Each task_struct structure contains a link to its first child, which is linked to the child’s sibling field. The last task_struct structure’s sibling in this chain is linked back to the parent’s children field. We can get the process list by traversing this task_struct tree starting from the init_task symbol, which is statically allocated in the kernel memory following boot. The first child of init_task is init with PID 1, and the second child is kthreadd which is the kernel daemon with PID 2. To get the task_struct structure from the children->next pointer we have to calculate the offset of the sibling field and then create a pointer which points to the first element of this struct. This can be done with the list_entry macro which is supplied by the kernel (see list_entry macro documentation).

2. Linux memory addressing

The virtual address of the root of the process list (the init_task symbol) can be found in the System.map static symbol table which is created during the kernel compilation. On a running system /proc/kallsyms contains these static symbols along with the dynamic symbols. This first task is not a running process but the head of the process tree. Its process id (PID) is 0 and it is called swapper.

The address of init_task (which is 0xffffff80089bad00 in our case) depends on the particular kernel version and configuration. This is a Kernel Virtual Address (more about memory addressing can be found here and here), which is linearly mapped in the physical memory, so it is easy to translate it to a physical address. This translation is implemented as a macro called __virt_to_phys in memory.h:

#define __virt_to_phys(x) ({                                        \
    phys_addr_t __x = (phys_addr_t)(x);                             \
    __x & BIT(VA_BITS - 1) ? (__x & ~PAGE_OFFSET) + PHYS_OFFSET :   \
        (__x - kimage_voffset); })

This macro first checks the type of the kernel virtual address and it calculates the offsets accordingly. The virtual address of init_task (0xffffff80089bad00) can be translated to physical address 0x00000000009bad00. The physical addresses are important for us, because we need to map the physical memory regions in OP-TEE with our Trusted Application to access them.

3. Implementation as a Linux kernel module

We can create a simple kernel module which traverses and prints the process list. This kernel module does not use built-in macros or types, because we are going to implement it as a trusted application in the future. Because of this, we have to define our own list_head and task_struct structures, and a list_entry macro.

To create a kernel module we have to implement the init_module function which is defined in the <linux/module.h> header in the kernel sources. To do this we create a directory called kmod_pslist next to the linux kernel sources directory, which in our case is called linux. This directory is where we put our kernel module source file linux_pslist.c and the following Makefile:

PWD =  $(shell pwd) LINUX_KERNEL =$ (PWD)/../linux
CROSS_COMPILE = " $(PWD)/../toolchains/bin/aarch64-linux-gnu-" obj-m += linux_pslist.o all:$ (MAKE) -C  $(LINUX_KERNEL) ARCH=arm64 CROSS_COMPILE=$ (CROSS_COMPILE) SUBDIRS= $(PWD) modules kernel:$ (MAKE) -C  $(LINUX_KERNEL) ARCH=arm64 CROSS_COMPILE=$ (CROSS_COMPILE) bcmrpi3_defconfig
     $(MAKE) -C$ (LINUX_KERNEL) ARCH=arm64 CROSS_COMPILE=$(CROSS_COMPILE)

To print out the process tree we can write a recursive function which traverses through the process list. This function prints the name and the PID of the task together with the address of the current task_struct structure’s virtual address. The full source code of linux_pslist.c is the following:

#include <linux/module.h>
/**
* The virtual address of the 'init_task' symbol
* Set this to the appropriate address found in System.map or /proc/kallsyms
*/
#define INIT_TASK UL(0xffffff80089bad00)
/**
* list_entry can convert a list member to its original type
* Source: https://elixir.bootlin.com/linux/v4.6.3/source/include/linux/list.h#L351
*/
#define my_list_entry(ptr, type, member) \
  ((type *)((char *)(ptr) - offsetof(type, member)))
/**
* Simple linked list structure
*/
struct my_list_head {
    struct my_list_head *next, *prev;
};
/**
* Custom 'task_struct' structure with the most basic fields.
*/
struct my_task_struct {
  char __padding_01[1056];
  int pid;                      /**< The process ID */
  char __padding_02[20];
  struct my_list_head children; /**< The first children of this process */
  struct my_list_head sibling;  /**< The parent process' next children */
  char __padding_03[368];
  char comm[16];                /**< The name of the process */
};
/**
* Recursive DFS for traversing 'task_struct' tree
* Prints out information about the running processes
* @param task the root of the process tree
* @param level the current level of indentation
*/
void print_pslist(struct my_task_struct *task, int level) {
  struct my_list_head *list;
  struct my_list_head *head;
  printk(KERN_INFO "%*s%s (%u) [%p]\n", 2*level, "", task->comm, task->pid, task);
  head = &task->children;
  for (list = head->next; list != head; list = list->next) {
    task = my_list_entry(list, struct my_task_struct, sibling);
    print_pslist(task, level+1);
  }
}
int init_module(void) {
  printk(KERN_INFO "Process tree:\n");
  print_pslist((struct my_task_struct *)(INIT_TASK), 0);
  return 0;
}

To compile the module we need to obtain the ARM64 cross-compilation toolchain from Linaro and extract it into a directory called toolchains. These tools can be downloaded from here.

We also need to download and extract the linux kernel sources into the linux directory (we are using a patched kernel from Linaro in our example):

$ wget http://releases.linaro.org/components/toolchain/binaries/7.2-2017.11/aarch64-linux-gnu/gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu.tar.xz
$ tar -xvf gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu.tar.xz
$ mv gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu toolchains
$ wget https://github.com/linaro-swg/linux/archive/rpi3-optee-4.6.zip
$ unzip rpi3-optee-4.6.zip
$ mv linux-rpi3-optee-4.6 linux

After obtaining all sources we have to compile our kernel first. You can speed up the compilation time with specifying the number of Make jobs with the -jX parameter. It is recommended to set this to at least the number of threads your processor can run simultaneously.

$ cd kmod_pslist
$ make kernel -j4

To compile the kernel module simply issue the make command in the kmod_pslist directory:

$ make

If the compilation was successful the kernel module linux_pslist.ko was created in the current directory. We can copy this to the Raspberry Pi and load it to see the process list:

$ sudo insmod linux_pslist.ko
$ dmesg

If the kernel module was loaded successfully a similar message should appear in the kernel log:

[ 8946.124100] Process tree:
[ 8946.129022] swapper/0 (0) [ffffff80089bad00]
[ 8946.135534]   init (1) [ffffffc034ce0000]
...
[ 8946.216968]   kthreadd (2) [ffffffc034ce0d00]
...

4. Mapping linux memory from OP-TEE

To access any memory in OP-TEE, it needs to be mapped. We need to map the area of the kernel memory where the init task symbol and the other task_structs are located. Every mapping must be PAGE_SIZE (usually 4KiB) aligned. For a more detailed description of memory management in OP-TEE see the Design Documentation. The mappings can be registered with register_phys_mem_ul macro. These macros can be called in core_mmu.c or platform specific main.c:

register_phys_mem_ul(MEM_AREA_RAM_NSEC, MAP_KERNEL_START, MAP_KERNEL_SIZE);
register_phys_mem_ul(MEM_AREA_RAM_NSEC, MAP_KERNEL_RAM_START, MAP_KERNEL_RAM_SIZE);

We also defined constants for start address and size of the areas in platform_config.h:

#define MAP_KERNEL_START        0x009b0000
#define MAP_KERNEL_SIZE         (8 * 1024 * 1024)
#define MAP_KERNEL_RAM_START    0x30000000
#define MAP_KERNEL_RAM_SIZE     (240 * 1024 * 1024)

According to the kernel virtual memory layout (see below), 0xffffff80089bad00(virt) 0x009bad00(phys) is in the .data, 0x30000000(phys) is in the memory region.

Memory mapping limitations

In core_mmu_lpae.h the MAX_XLAT_TABLES constant must be set to 10 minimum, because with lower values, OP-TEE panics, the 248 MiB NW memory can not be mapped. The core_mmu_entry_to_finer_grained tries to make the tables “finer grained” (creating more small L1 tables probably) but cannot since it reaches the defined maximum translation (xlat) table limit. Setting the limit higher solves the problem.

#define MAX_XLAT_TABLES 10

RPi3 memory maps

Below is the physical and kernel virtual memory map of the RPi3.

Physical memory map

Every constant name on the map is from OP-TEE platform_config.h.

                0x4000_0000          DRAM0_SIZE
               +--------------------------------+
               |    Device Base                 | UART: 0x3f215040
               |                                | WDT : 0x3f100000
               |0x3f00_0000                     |
               +--------------------------------+
               |    NW RAM                      |
               |                                | ~860 MiB usable RAM probably
               |0x0a00_0000 or 0x0a40_0000      |
             ^ +--------------------------------+
             | |    ???                         |
  Secure RAM | |                                | 12 MiB or 8 MiB
      32 MiB | |0x0980_0000                     | unused secure RAM?
   or 28 MiB | +--------------------------------+
             | |    TA RAM                      |
             | |                         16 MiB |
             | |0x0880_0000    CFG_TA_RAM_START |
           ^ | +--------------------------------+
           | | |    OP-TEE Core RAM             |
OP-TEE RAM | | |                                | BL32
     4 MiB | | |0x0842_0000   CFG_TEE_LOAD_ADDR |
           | | +--------------------------------+
           | | |    ARM TF                      |
           | | |                        128 KiB | BL31
           | | |0x0840_0000         TZDRAM_BASE | == CFG_TEE_RAM_START
           v v +--------------------------------+
               |    NS SHM                      |
               |                          4 MiB | Non-secure shared memory
               |0x0800_0000     CFG_SHMEM_START |
             ^ +--------------------------------+
             | |    Linux DTB                   | Linux kernel RAM | |                                |
   127.5 MiB | |0x0170_0000                     |
             | +--------------------------------+
             | |    Linux kernel                |
             | | +----------------------------+ |
             | | |    BL30 MCU FW             | | BL30: early tmp buffer for
             | | |                      1 MiB | | MCU firmware, parsed by BL32
             | | |0x0100_0000                 | |
             | | +----------------------------+ |
             | |0x0008_0000                     |
             v +--------------------------------+
               |    U-Boot                      | Stubs + U-Boot,
               |                                | U-Boot self-relocates
               |0x0000_0000          DRAM0_BASE | to high memory
               +--------------------------------+

Note: There is a slight inconsistency between ARM TF platform specific config constants and OP-TEE config constants: ARM TF defines the secure RAM 28 MiB (DRAM_SEC_SIZE) in size, while OP-TEE defines 32 MiB (TZDRAM_SIZE). Also there is 8 or 12 MiB unused secure RAM according to the definitions in the config files at least.

Kernel virtual memory map

Virtual kernel memory layout:
    modules : 0xffffff8000000000 - 0xffffff8008000000   (   128 MB)
    vmalloc : 0xffffff8008000000 - 0xffffffbdbfff0000   (   246 GB)
      .text : 0xffffff8008080000 - 0xffffff80086d7000   (  6492 KB)
    .rodata : 0xffffff80086d7000 - 0xffffff8008914000   (  2292 KB)
      .init : 0xffffff8008914000 - 0xffffff80089aa000   (   600 KB)
      .data : 0xffffff80089aa000 - 0xffffff8008a63400   (   741 KB)
    vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
              0xffffffbdc0000000 - 0xffffffbdc0ec0000   (    14 MB actual)
    fixed   : 0xffffffbffe7fd000 - 0xffffffbffec00000   (  4108 KB)
    PCI I/O : 0xffffffbffee00000 - 0xffffffbfffe00000   (    16 MB)
    memory  : 0xffffffc000000000 - 0xffffffc03b000000   (   944 MB)

5. Implementation as a Trusted Application

There are two types of Trusted Applications (TA) in OP-TEE. The most used one is called user mode TAs. They are full featured Trusted Applications as specified by the GlobalPlatform TEE specifications. User TAs are loaded into memory from NW untrusted file system by OP-TEE when called (by specifying their UUID). They can access the GlobalPlatform core API and run in a lower exception level than OP-TEE core.

The other type is called pseudo TAs. These are implemented directly in the OP-TEE Core tree, statically built into the OP-TEE core and bacuse of this can not access the GlobalPlatform core API. They are also called by specifying their UUID and run in the same exception level as OP-TEE Core (like kernel modules in Linux). For a more detailed explanation see the Design Documentation TA section.

The pseudo TA uses the same algorithm that the Kernel Module. The main difference is that it must be independent of the Linux kernel headers used in the module and since it is compiled into OP-TEE Core there is very limited standard IO and string manipulation support. The only usable printing macros are defined in trace.h like DMSG and IMSG.

The entry point of the TA is static TEE_Result invoke_command, just like in user TAs. But it must be registered with the pseudo_ta_register macro. For example:

pseudo_ta_register(.uuid = PSLIST_UUID, .name = TA_NAME,
           .flags = PTA_DEFAULT_FLAGS,
           .invoke_command_entry_point = invoke_command);

We implemented the Linux kernel macros (defined in memory.h) in the pseudo TA for translating NW virtual addresses to physical. Then the physical addresses must be converted to SW virtual addresses so they can be used in code. This is achieved by calling phys_to_virt (defined in core_memprot.h) (ref. issue #1496).

/* Translates NW Linux kernel virtual addresses to physical addresses. */
inline static paddr_t NW_V2P(vaddr_t x) {
    return (paddr_t) (x & __BIT(VA_BITS - 1) ? (x & ~PAGE_OFFSET) + PHYS_OFFSET : (x - KIMAGE_VADDR));
}
/* Translates NW Linux kernel virtual addresses to SW OP-TEE virtual addresses. */
inline static vaddr_t V2V(void* x) {
    return (vaddr_t)phys_to_virt(NW_V2P((vaddr_t)x), MEM_AREA_RAM_NSEC);
}

The pseudo TA source files can be placed in core/arch/arm/pta or in core/arch/arm/plat-rpi3 and must be added to sub.mk in those directories to be compiled into OP-TEE Core.

sub.mk:

srcs-y += pslist.c

After extending sub.mk build OP-TEE again, and the pseudo TA is going to be compiled into it.

Full TA source code:

#include <trace.h>
#include <kernel/pseudo_ta.h>
#include <mm/core_memprot.h>
#define TA_NAME     "pslist.ta"
#define PSLIST_UUID \
        { 0x5e286bf0, 0x4494, 0x11e8, \
            { 0x84, 0x2f, 0x0e, 0xd5, 0xf8, 0x9f, 0x71, 0x8b } }
#define INIT_TASK       0xffffff80089bad00
#define SZ_128M         0x08000000
#define __BIT(nr)       (1UL << (nr))
#define VA_BITS         39
#define PAGE_OFFSET     ((0xffffffffffffffff) << (VA_BITS - 1))
#define PHYS_OFFSET     0
#define VA_START        ((0xffffffffffffffff) << VA_BITS)
#define KIMAGE_VADDR    (MODULES_END)
#define MODULES_END     (MODULES_VADDR + MODULES_VSIZE)
#define MODULES_VADDR   (VA_START)
#define MODULES_VSIZE   (SZ_128M)
inline static paddr_t NW_V2P(vaddr_t x) {
    return (paddr_t) (x & __BIT(VA_BITS - 1) ? (x & ~PAGE_OFFSET) + PHYS_OFFSET : (x - KIMAGE_VADDR));
}
inline static vaddr_t V2V(void* x) {
    return (vaddr_t)phys_to_virt(NW_V2P((vaddr_t)x), MEM_AREA_RAM_NSEC);
}
#define list_entry(ptr, type, member) ((type *)((char *)(ptr) - offsetof(type, member)))
struct my_list_head {
    struct my_list_head *next, *prev;
};
struct my_task_struct {
  char __padding_01[1056];
  unsigned int pid;
  char __padding_02[20];
  struct my_list_head children;
  char __padding_03[384];
  char comm[16];
};
static inline void print_task(struct my_task_struct *task, int level) {
  char buf[256];
  for (int i=0; i<level; ++i) {
    buf[i]=' ';
  }
  buf[level]='\0';
  IMSG("%s %s (%u) [%p]\n", buf, task->comm, task->pid, (void*)virt_to_phys(task));
}
void traverse_list(struct my_list_head *head, int level);
static void print_tasks(struct my_task_struct *task, int level) {
  print_task(task, level);
  traverse_list(&task->children, level);
}
void traverse_list(struct my_list_head *head, int level) {
  struct my_task_struct *task;
  struct my_list_head *list;
  list = (struct my_list_head *)V2V(head->prev);
  for (list = (struct my_list_head *)V2V(head->prev); list != head; list = (struct my_list_head *)V2V(list->prev)) {
    task = list_entry(list, struct my_task_struct, __padding_03);
    print_tasks(task, level+1);
  }
}
static TEE_Result invoke_command(void *psess __unused,
                 uint32_t cmd __unused, uint32_t ptypes __unused,
                 TEE_Param params[TEE_NUM_PARAMS] __unused)
{
    print_tasks((struct my_task_struct *)(V2V((void*)INIT_TASK)), 0);
    return TEE_SUCCESS;
}
pseudo_ta_register(.uuid = PSLIST_UUID, .name = TA_NAME,
           .flags = PTA_DEFAULT_FLAGS,
           .invoke_command_entry_point = invoke_command);

6. Creating a client application for our TA

In the client app, the pseudo TA should be called by its UUID and it prints the output on UART:

#define TA_PSLIST_UUID \
    { 0x5e286bf0, 0x4494, 0x11e8, \
            { 0x84, 0x2f, 0x0e, 0xd5, 0xf8, 0x9f, 0x71, 0x8b } }
// ...
res = TEEC_InvokeCommand(&sess, NULL, NULL,
                 &err_origin);

To build the client app see OP-TEE Example repo. To install it, simply copy it to the root file system e.g. in /usr/bin.

Full client source code:

#include <err.h>
#include <stdio.h>
#include <string.h>
/* OP-TEE TEE client API (built by optee_client) */
#include <tee_client_api.h>
#define TA_PSLIST_UUID \
    { 0x5e286bf0, 0x4494, 0x11e8, \
            { 0x84, 0x2f, 0x0e, 0xd5, 0xf8, 0x9f, 0x71, 0x8b } }
int main(int argc, char *argv[])
{
    TEEC_Result res;
    TEEC_Context ctx;
    TEEC_Session sess;
    TEEC_Operation op;
    TEEC_UUID uuid = TA_PSLIST_UUID;
    uint32_t err_origin;
    /* Initialize a context connecting us to the TEE */
    res = TEEC_InitializeContext(NULL, &ctx);
    if (res != TEEC_SUCCESS)
        errx(1, "TEEC_InitializeContext failed with code 0x%x", res);
    /*
     * Open a session to the "pslist" TA
     */
    res = TEEC_OpenSession(&ctx, &sess, &uuid,
                   TEEC_LOGIN_PUBLIC, NULL, NULL, &err_origin);
    if (res != TEEC_SUCCESS)
        errx(1, "TEEC_Opensession failed with code 0x%x origin 0x%x",
            res, err_origin);
    /*
     * Execute a function in the TA by invoking it, in this case
     * we're incrementing a number.
     *
     * The value of command ID part and how the parameters are
     * interpreted is part of the interface provided by the TA.
     */
    /* Clear the TEEC_Operation struct */
    memset(&op, 0, sizeof(op));
    op.paramTypes = TEEC_PARAM_TYPES(TEEC_NONE, TEEC_NONE,
                     TEEC_NONE, TEEC_NONE);
    /* Calling the TA */
    printf("Invoking TA pslist\n");
    res = TEEC_InvokeCommand(&sess, NULL, NULL,
                 &err_origin);
    if (res != TEEC_SUCCESS)
        errx(1, "TEEC_InvokeCommand failed with code 0x%x origin 0x%x",
            res, err_origin);
    printf("TA printed pslist\n");
    /*
     * We're done with the TA, close the session and
     * destroy the context.
     */
    TEEC_CloseSession(&sess);
    TEEC_FinalizeContext(&ctx);
    return 0;
}

Further issues and problems encountered

Securing the RPi3 hardware watchdog timer (unsuccessfully)

The Raspberry Pi 3 has a hardware watchdog timer (WDT) in the BCM2837 SoC. We wanted to use this timer securely from OP-TEE in such a way that only the Secure World (SW) can access the WDT. By doing this, periodical execution of a Trusted Application (e.g., our integrity monitoring app) could be ensured. Unfortunately, we could not achieve this with the RPi3 board. The device peripherals are mapped in high memory, and there is currently no known method to secure parts of that memory region without potentially causing problems for the Normal World operations and the Linux kernel. The lack of security features of the board makes this process generally harder.

Lack of information in general

BCM2837 does not have a publicly available documentation. The official page references to the BCM2835 and BCM2836 SoC used in the previous generation Pis. In these two documents there is only minimal information about the WDT. The 2836 mentions that one of the timers on the SoC can be used as a watchdog timer, but provides not further information.

Addressing

The datasheets mention 3 type of addressing modes, physical, virtual and bus. The addresses in the documents are bus addresses, therefore can not be directly used in programs. Also the physical addresses mapping is different form the real physical address mapping of the RPi3. This could potentially add another layer of confusion when trying to work with peripheral addresses (eg. in code).

Watchdog reset cycle

One other possible problem could be that the WDT on this board does not fully reset the device (eg. not re-executing bootcode.bin which is the first stage bootloader on the GPU), like a full power-cycle reset does. Some information about this can be found in the Raspberry Pi Forums and in comments in the Linux driver of the WDT.

Memory regions

The physical address of the device peripherals can be found in the ARM-TF RPi3 platform specific source code (contributed by Sequitur Labs): 0x3f000000. This extends to the end of the physical memory (0x40000000).
Every peripheral offered by the SoC is mapped in this region (WDT and other timer registers, GPIO, UART, SPI, I2C…).

The WDT physical address is also defined in the source code (and not documented anywhere else): 0x3f100000. (The used C struct is also available from the Linux driver)

This memory region (0x3f000000 - 0x40000000) is non-secure – accessible by the Normal World too.
ARM-TF configures the memory separation of the NW and SW, because it executes in a higher Exception Level (EL3) than OP-TEE. Only one region of the memory can be made secure in this way.

Possible solutions

Remapping OP-TEE: One solution for securing the peripheral region, could be to remap the OP-TEE secure region, to include this peripheral region too. But this is probably not recommended, since every other device – used and required by the NW Linux kernel is mapped there too, and doing this could lead to hard to identify problems.
ARM TZPC: The best solution is to use ARM TZPC (TrustZone Protection Controller). With the TZPC, 24 separate memory regions can be secured – configured to be accessible only by the SW. Unfortunately the RPi3 lack this peripheral component. Confirmation of this can can be found in LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3 (page 7).
External WDT: Since the GPIO also can not be made secure, this solution could only add the possibility of a full power-cycle reset for the board, if that is required. For a discussion on this refer to this RPi forum thread.
ARM-TF TSP (Test Secure Payload) secure timers: Instead of using WDT, this approach uses the ARM secure timers to periodically execute TA code. There is an issue for this on the OP-TEE Github page, but no further information for the RPi board, or the status or success of the approach.
In this ARM Community question there are instructions for an other board. But right now this approach seems like a long shot.

Note: Other solutions for securing any memory region probably does not exist publicly. Refer to this OP-TEE Github issue.

Other resources

Lengthy discussions on RPi watchdog timers (hardware, software) in this RPi forum thread
On calling TA without NW in this issue
Linux kernel Watchdog API
General Linux watchdog usage