Michael Altfield's gravatar

Detecting (Malicious) Unicode in GitHub PRs

This article will describe how you can utilize GitHub Actions to scan user-contributed PRs for unicode and automatically warn you if such commits contain (potentially invisible & malicious) unicode characters.

Detecting Malicious Unicode in GitHub Pull Requests

Why

Last month Trojan Source was published — which described how malicious unicode characters could make source code appear benign, yet compile to something quite malicious.

Consider one novel example presented by the Trojan Source paper: using bidirectional (bidi) unicode characters in combination with a comment to do something nasty:

/*  begin admins only */  if (isAdmin) { 
  printf( “Welcome, Admin!\n” );

The animation above demonstrates an attack when unterminated bidirectional unicode characters span across comments and code.

While malicious use of bidirectional unicode characters has been well-known since at least 2009, this is the first publication that acknowledged the threat of unterminated bidi characters spanning between code and comments. And, indeed, the researches point-out that they’ve found numerous cases of bidi characters used to obfuscate code on GitHub spanning at least Ruby and JavaScript in the wild.

Assumptions

I designed this solution for an Open Source project that I maintain on GitHub. Therefore:

  1. It’s assumed the reader is using GitHub
  2. My code & comments are written in English (so there’s very few places where unicode characters are “normal”)
  3. This article was written in 2021

The Solution

Since 2019, GitHub Actions gave open source projects the ability to run CI pipelines on Microsoft’s shared runners for free.

To use GitHub Actions, simply add a yaml file to your GitHub repo’s .github/workflows/ directory, as described here.

In this case, we create a workflow file named unicode_warn.yml.

################################################################################
# File:    .github/workflows/unicode_warn.yml
# Version: 0.1
# Purpose: Detects Unicode in PRs and comments the results of findings in PR
#           * https://tech.michaelaltfield.net/bidi-unicode-github-defense/
# Authors: Michael Altfield <michael@michaelaltfield.net>
# Created: 2021-11-20
# Updated: 2021-11-20
################################################################################
name: malicious_sanity_checks

# execute this workflow automatically on all PRs
on: [pull_request]

jobs:

  unicode_warn:

    runs-on: ubuntu-latest
    container: debian:bullseye-slim

    steps:

    - name: Prereqs
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      run: |
        apt-get update
        apt-get install -y git bsdmainutils
        git clone "https://token:${GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" .
      shell: bash

    - name: Check diff for unicode
      id: unicode_diff
      run: |
        set -x
        diff=`git diff --unified=0 ${{ github.event.pull_request.base.sha }} ${{ github.event.pull_request.head.sha }} | grep -E "^[+]" | grep -Ev '^(--- a/|\+\+\+ b/)'`
        unicode_diff=`echo -n "${diff}" | grep -oP "[^\x00-\x7F]*"`
        unicode_grep_exit_code=$?
        echo "${unicode_diff}"

        unicode_diff_hexdump=`echo -n "${unicode_diff}" | hd`
        echo "${unicode_diff_hexdump}"

        # did we select any unicode characters?
        if [[ "${unicode_diff_hexdump}" == "" ]]; then
          # we didn't find any unicode characters
          human_result="INFO: No unicode characters found in PR's commits"
          echo "${human_result}"

        else
          # we found at least 1 unicode character
          human_result="^^ WARNING: Unicode characters found in diff!"
          echo "${human_result}"
          echo "${diff}"

        fi

        echo "UNICODE_HUMAN_RESULT=${human_result}" >> $GITHUB_ENV

      shell: bash {0}

    # leave a comment on the PR. See also
    #  * https://stackoverflow.com/a/64126737
    # make sure this doesn't open command injection risks
    #  * https://github.com/victoriadrake/github-guestbook/issues/1#issuecomment-657121754
    - name: Leave comment on PR
      uses: actions/github-script@v5
      with:
        github-token: ${{secrets.GITHUB_TOKEN}}
        script: |
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: "${{ env.UNICODE_HUMAN_RESULT }}"
          })
⚠ UPDATE 2024-08-08: The simple example workflow above is outdated. If you encounter errors, try using the latest version in github:

The above file defines a workflow named malicious_sanity_checks with job named unicode_warn. This job contains multiple steps that will execute every time a new PR is created in your repo:

  1. Prereqs – First, basic depends like git and hd are installed
  2. Check diff for unicode – A simple BASH script uses grep to detect non-ascii characters in a diff across the PR’s commits-to-be-merged
  3. Leave comment on PR – Adds a comment on the PR indicating if the commits include unicode characters or not

Example

I created an example detect-malicious-unicode repo on GitHub to demonstrate the above GitHub Actions workflow.

screenshot showing a comment on a PR made by "github-actions bot" indicating that the PR containts unicode characters in diff

PR request with malicious bidirectional unicode is detected by “GitHub Actions Bot”

screenshot showing a comment on a PR made by "github-actions bot" indicating that the PR containts unicode characters in diff

PR request with malicious homoglypgh unicode is detected by “GitHub Actions Bot”

You can see this in-action with the two example Pull Requests of the repo.

  1. Malicious Bidi PR
  2. Malicious Homoglyph PR

The first demonstrates a PR attempting to merge malicious bidirectional unicode characters. The second demonstrates a PR attempting to merge malicious homoglypgh unicode characters.

Improvements

This section will outline some ways that this article’s solution can be improved

Too broad of unicode match

The solution presented here broadly detects all non-ascii characters. It adds a comment to a PR whenever said PR’s commits include unicode characters, and it logs a hexdump of these characters — alerting a code reviewer to potential issues and permitting them to investigate further.

This could further be improved by narrowing the scope of the characters matched to only detect certain types of potential attacks, such as by detecting only:

  1. text directionality control characters
  2. unterminaed bidi override characters within string literals and comments

ⓘ Note: In response to Trojan Source, GitHub started displaying warnings when viewing source code that contains bidirectional unicode characters (example) in their WUI.

While this works great for reviewing code containing bidi-type attacks, it does not alert for code containing unicode homoglyph attacks (example).

Non-Failing Action

This workflow could be improved by actually exiting non-zero in a subsequent step (after the comment is made) to make it more obvious to the PR reviewer that the code should not be merged.

Further Reading

For more information about Trojan Source and its defenses, see:

  1. The Trojan Source research paper
  2. CVE-2021-42574 (bidi)
  3. CVE-2021-42694 (homoglyph)
  4. The Trojan Source GitHub repo

Related Posts

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>