This article will describe how you can utilize GitHub Actions to scan user-contributed PRs for unicode and automatically warn you if such commits contain (potentially invisible & malicious) unicode characters.
Why
Last month Trojan Source was published — which described how malicious unicode characters could make source code appear benign, yet compile to something quite malicious.
Consider one novel example presented by the Trojan Source paper: using bidirectional (bidi) unicode characters in combination with a comment to do something nasty:
/* begin admins only */ if (isAdmin) {
printf( “Welcome, Admin!\n” );
The animation above demonstrates an attack when unterminated bidirectional unicode characters span across comments and code.
While malicious use of bidirectional unicode characters has been well-known since at least 2009, this is the first publication that acknowledged the threat of unterminated bidi characters spanning between code and comments. And, indeed, the researches point-out that they’ve found numerous cases of bidi characters used to obfuscate code on GitHub spanning at least Ruby and JavaScript in the wild.
Assumptions
I designed this solution for an Open Source project that I maintain on GitHub. Therefore:
- It’s assumed the reader is using GitHub
- My code & comments are written in English (so there’s very few places where unicode characters are “normal”)
- This article was written in 2021
The Solution
Since 2019, GitHub Actions gave open source projects the ability to run CI pipelines on Microsoft’s shared runners for free.
To use GitHub Actions, simply add a yaml file to your GitHub repo’s .github/workflows/
directory, as described here.
In this case, we create a workflow file named unicode_warn.yml
.
################################################################################ # File: .github/workflows/unicode_warn.yml # Version: 0.1 # Purpose: Detects Unicode in PRs and comments the results of findings in PR # * https://tech.michaelaltfield.net/bidi-unicode-github-defense/ # Authors: Michael Altfield <michael@michaelaltfield.net> # Created: 2021-11-20 # Updated: 2021-11-20 ################################################################################ name: malicious_sanity_checks # execute this workflow automatically on all PRs on: [pull_request] jobs: unicode_warn: runs-on: ubuntu-latest container: debian:bullseye-slim steps: - name: Prereqs env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | apt-get update apt-get install -y git bsdmainutils git clone "https://token:${GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" . shell: bash - name: Check diff for unicode id: unicode_diff run: | set -x diff=`git diff --unified=0 ${{ github.event.pull_request.base.sha }} ${{ github.event.pull_request.head.sha }} | grep -E "^[+]" | grep -Ev '^(--- a/|\+\+\+ b/)'` unicode_diff=`echo -n "${diff}" | grep -oP "[^\x00-\x7F]*"` unicode_grep_exit_code=$? echo "${unicode_diff}" unicode_diff_hexdump=`echo -n "${unicode_diff}" | hd` echo "${unicode_diff_hexdump}" # did we select any unicode characters? if [[ "${unicode_diff_hexdump}" == "" ]]; then # we didn't find any unicode characters human_result="INFO: No unicode characters found in PR's commits" echo "${human_result}" else # we found at least 1 unicode character human_result="^^ WARNING: Unicode characters found in diff!" echo "${human_result}" echo "${diff}" fi echo "UNICODE_HUMAN_RESULT=${human_result}" >> $GITHUB_ENV shell: bash {0} # leave a comment on the PR. See also # * https://stackoverflow.com/a/64126737 # make sure this doesn't open command injection risks # * https://github.com/victoriadrake/github-guestbook/issues/1#issuecomment-657121754 - name: Leave comment on PR uses: actions/github-script@v5 with: github-token: ${{secrets.GITHUB_TOKEN}} script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: "${{ env.UNICODE_HUMAN_RESULT }}" })
The above file defines a workflow named malicious_sanity_checks
with job named unicode_warn
. This job contains multiple steps that will execute every time a new PR is created in your repo:
- Prereqs – First, basic depends like
git
andhd
are installed - Check diff for unicode – A simple BASH script uses
grep
to detect non-ascii characters in a diff across the PR’s commits-to-be-merged - Leave comment on PR – Adds a comment on the PR indicating if the commits include unicode characters or not
Example
I created an example detect-malicious-unicode repo on GitHub to demonstrate the above GitHub Actions workflow.
You can see this in-action with the two example Pull Requests of the repo.
The first demonstrates a PR attempting to merge malicious bidirectional unicode characters. The second demonstrates a PR attempting to merge malicious homoglypgh unicode characters.
Improvements
This section will outline some ways that this article’s solution can be improved
Too broad of unicode match
The solution presented here broadly detects all non-ascii characters. It adds a comment to a PR whenever said PR’s commits include unicode characters, and it logs a hexdump of these characters — alerting a code reviewer to potential issues and permitting them to investigate further.
This could further be improved by narrowing the scope of the characters matched to only detect certain types of potential attacks, such as by detecting only:
- text directionality control characters
- unterminaed bidi override characters within string literals and comments
ⓘ Note: In response to Trojan Source, GitHub started displaying warnings when viewing source code that contains bidirectional unicode characters (example) in their WUI.
While this works great for reviewing code containing bidi-type attacks, it does not alert for code containing unicode homoglyph attacks (example).
Non-Failing Action
This workflow could be improved by actually exiting non-zero in a subsequent step (after the comment is made) to make it more obvious to the PR reviewer that the code should not be merged.
Further Reading
For more information about Trojan Source and its defenses, see:
- The Trojan Source research paper
- CVE-2021-42574 (bidi)
- CVE-2021-42694 (homoglyph)
- The Trojan Source GitHub repo
Related Posts
Hi, I’m Michael Altfield. I write articles about opsec, privacy, and devops ➡
Leave a Reply