Covert channels are real

2021-08-07 [ security vpn tap networking ]

At work, we use a remote-desktop based VPN as a measure of "security" -- thought process behind it is "real network access from unauthorized devices would be bad". However, as I show in this proof of concept, this process only limits those that are not really invested on getting access.

Limitations

My PC on the remote end does not have direct access to the internet, no direct outbound TCP/UDP/ICMP connections are allowed.

Ways to reach the outside world

HTTP via an HTTP proxy that has a set of whitelisted sites.
DNS via a local resolver, for the necessary HTTP traffic. Resolution is not limited to whitelisted domains.
HTTP via another remote-rendering layer: Menlo. Surely TWO layers of security theater are better than one?

Basic principle

A covert channel is a way to use some unrelated mechanism to transfer data over it.
Examples range from tunneling protocols over protocols to using heat output / cpu load as a way to emit data.

Any mechanism that allows the two sides to send data to each other can be coerced into a network connection.

IP over DNS

This first item was easy and simple, as DNS queries reach the external world without being filtered by a whitelist, we can use the DNS request-reply protocol to set up a tunnel.

This has been done extensively in the past and I only used iodine which worked just fine.

Finding usable covert channels

There are a few conveniences granted to users which can be exploited to send crafted data, notably:

Bidirectional clipboard support
Bidirectional audio support
Ability to use a keyboard and mouse
Ability to see the remote screen
Ability to use a webcam

However, there are limitations to these channels:

The bidirectional cipboard support is wonky: the remote side will receive updates up to once every 100ms (iff the window is focused!), but the local side only gets updates when the VPN window loses focus
The bidirectional audio support is wonky: playing audio has approximately 1.5s of delay and the audio-in (microphone) seems affected by extreme underruns; popping and saturation happens constantly
- And most lovely part of all, my VM gets out of sync with the audio devices if I reconnect to the VPN! so I have to restart the VM every time I reconnect
Using a keyboard and mouse, while the primary function of the VPN, is also wonky. Typing too fast will cause the remote side to lose track of KEYUP events and keys will get "stuck".
The screen is transferred over H264 at decent bitrate, but clearly whatever is encoding it skips frames regularly.
The webcam support is extremely wonky. Colors are shifted, image is delayed and framerate consistently stays at a choppy ~8 fps.

Taking in consideration these limitations, here's a summary of the possible channels:

Clipboard only towards the VPN
Audio only from the VPN
Video only from the VPN

With better audio/video filtering (or less buggy implementations from the VPN) we could also feed audio and video into the VPN.

Pretty much convinced this bugginess is due to bad implementations, but maaaaaaaybe these are "security" measures?

Getting data in the VPN

As a proof of concept, I constantly updated the clipboard to get a rough idea of its capabilities:

for i in $(seq 1 1000); do
    echo -n $i | xclip -i -selection clipboard;
    sleep 0.125;
done

The clipboard misses some events if the delay is less than 125ms and at this rate some packet loss is worse than slightly higher latency.

The clipboard seems to easily gobble 100KB. This is a reliable, bit-perfect way of getting data in, I will not bother trying to use the keyboard/mouse/microphone/webcam feed for now.

Data over Audio

Ideally, I'd be able to get pure analog output from the remote connection, but this is not the case, the audio is compressed fairly heavily, and taht compression is not without skips/pops.

The easiest way to encode data into audio is to use minimodem, although the data rates are abysmal. A regular modem going over crappy copper should be able to do 28-56kbps, but with minimodem over RDP (and no "training"!) and 1500ms latency the data rate is FIXME.

Possible improvements: * Stereo audio for twice the bitrate! * Figure out why the latency is so bad * Better encoding scheme (tried amodem but it kept crashing).

Data over video

With the basic function of the VPN being "showing the user his desktop" I can use this to get data out via some type of encoding.

I picked QR codes mostly because there are available libraries, as far as I'm aware a data matrix is a better way to encode this.

On top of that inefficiency, I couldn't get the decoder library I chose to decode binary data; it is currently not supported, so I'm encoding the data with Ascii85 which adds a 25% of overhead.

The pyzbar library is fully written in python, which usually means it is "slow" (at least for this unexpected use-case) -- codes containing ~1KB of payload were taking up to 500ms to render. I gave PyPy a shot and it took the runtime for those big generations down to ~75ms. Still bad, but good enough.

Limitations:

Framerate is not super high and it clearly skips when there are a lot of changes
Capture is not synchronized with printing, so out-of-sync capture = packet loss

Possible improvements:

Color usage? Might get obliterated by H264
Running the remote desktop at a lower resolution? Might increase refresh rate / lessen frame skips.
Data matrix instead of a QR code? Couldn't find anything conclusive on "density" of each

Piping data as a network connection

Linux has a very handy feature: user-mode networking via TUN/TAP interfaces.

They are great and super easy to use -- after you run a few ioctl calls you get a file descriptor on which you can just read and write and it will be sent over the network!

The main distinction between TUN and TAP (that I know of) is that TAP is a Layer-2 interface, meaning that it will wrap everything that you send in an Ethernet frame (so if you wanted to communicate via IP, you are paying the ethernet frame size overhead as well), also because of this interface being L2, it needs the destination MAC addresses, which it should get from an ARP request/reply, but I couldn't get the remote side to reply to the ARP requests it got (either way).

A TUN interface is purely L3 -- it only sends raw/bare IP packets, less overhead, no ARP, great!

Implementation

The implementation is fairly straightforward:

Open a TUN network device
Read data from the input covert channel and write it to the TUN (outgoing data)
Read data from the TUN and write it to the output covert channel (incoming data)

Specifically the listed cases are:

Audio
- VPN IN: clipboard -> TUN
- VPN OUT: TUN -> audio card
- Desktop IN: audio card -> TUN
- Desktop OUT: TUN -> clipboard
Video
- VPN IN: clipboard -> TUN
- VPN OUT: TUN -> screen (print QR code)
- Desktop IN: screen (capture) -> TUN
- Desktop OUT: TUN -> clipboard

Everything I implemented here was done badly, in a single afternoon. There are no robustness mechanisms, everything is up to the upper layers, even though some basic schemes could improve the usefulness of this dramatically.

There's no acknowledgement of messages received, batching, anything. The only thing that's there is a "sequence number" simply to check for "uniqueness" in the clipboard, as it can't be cleared (no ACKing of messages, remember?!).

Demos

Everything in the video is running inside the VPN, the traffic shown is all going to my machine via the covert channels.

CLIP IN : 2.47MiB 0:01:00 = 42KB/s (via nc so the bottleneck is still QR parsing for ACKs) QR OUT: 359KiB 0:01:00 = 6KB/s -> parsing QR takes 70-175ms (including capture)

AUDIO OUT: 9,48KiB 0:01:00 = 0.15KB/s (!)

Ping over audio <-> clipboard Be careful with your sound level

Ping over QR code <-> clipboard

Mosh over QR code <-> clipboard

Things to investigate

With latency being at least 100ms one-way (clipboard) and ~33-100ms the other way (33ms for a frame at 30FPS and upper threshold of 100 for the capture) acknowledgement for messages would improve everything, as would aggregation of queued packets (the trivial implementation does not aggregate).

Mumbling about computers