At work, we use a remote-desktop based VPN as a measure of "security" -- thought process behind it is "real network access from unauthorized devices would be bad". However, as I show in this proof of concept, this process only limits those that are not really invested on getting access.
My PC on the remote end does not have direct access to the internet, no direct outbound TCP/UDP/ICMP connections are allowed.
Ways to reach the outside world
- HTTP via an HTTP proxy that has a set of whitelisted sites.
- DNS via a local resolver, for the necessary HTTP traffic. Resolution is not limited to whitelisted domains.
- HTTP via another remote-rendering layer: Menlo. Surely TWO layers of security theater are better than one?
A covert channel is a way to use some unrelated mechanism to transfer data over it.
Examples range from tunneling protocols over protocols to using heat output / cpu load as a way to emit data.
Any mechanism that allows the two sides to send data to each other can be coerced into a network connection.
IP over DNS
This first item was easy and simple, as DNS queries reach the external world without being filtered by a whitelist, we can use the DNS request-reply protocol to set up a tunnel.
This has been done extensively in the past and I only used iodine which worked just fine.
Finding usable covert channels
There are a few conveniences granted to users which can be exploited to send crafted data, notably:
- Bidirectional clipboard support
- Bidirectional audio support
- Ability to use a keyboard and mouse
- Ability to see the remote screen
- Ability to use a webcam
However, there are limitations to these channels:
- The bidirectional cipboard support is wonky: the remote side will receive updates up to once every 100ms (iff the window is focused!), but the local side only gets updates when the VPN window loses focus
- The bidirectional audio support is wonky: playing audio has approximately 1.5s of delay and the audio-in (microphone)
seems affected by extreme underruns; popping and saturation happens constantly
- And most lovely part of all, my VM gets out of sync with the audio devices if I reconnect to the VPN! so I have to restart the VM every time I reconnect
- Using a keyboard and mouse, while the primary function of the VPN, is also wonky. Typing too fast will cause the remote side to lose track of KEYUP events and keys will get "stuck".
- The screen is transferred over H264 at decent bitrate, but clearly whatever is encoding it skips frames regularly.
- The webcam support is extremely wonky. Colors are shifted, image is delayed and framerate consistently stays at a choppy ~8 fps.
Taking in consideration these limitations, here's a summary of the possible channels:
- Clipboard only towards the VPN
- Audio only from the VPN
- Video only from the VPN
With better audio/video filtering (or less buggy implementations from the VPN) we could also feed audio and video into the VPN.
Pretty much convinced this bugginess is due to bad implementations, but maaaaaaaybe these are "security" measures?
Getting data in the VPN
As a proof of concept, I constantly updated the clipboard to get a rough idea of its capabilities:
for i in $(seq 1 1000); do echo -n $i | xclip -i -selection clipboard; sleep 0.125; done
The clipboard misses some events if the delay is less than 125ms and at this rate some packet loss is worse than slightly higher latency.
The clipboard seems to easily gobble 100KB. This is a reliable, bit-perfect way of getting data in, I will not bother trying to use the keyboard/mouse/microphone/webcam feed for now.
Data over Audio
Ideally, I'd be able to get pure analog output from the remote connection, but this is not the case, the audio is compressed fairly heavily, and taht compression is not without skips/pops.
The easiest way to encode data into audio is to use minimodem, although the data rates are abysmal. A regular modem going over crappy copper should be able to do 28-56kbps, but with minimodem over RDP (and no "training"!) and 1500ms latency the data rate is FIXME.
Possible improvements: * Stereo audio for twice the bitrate! * Figure out why the latency is so bad * Better encoding scheme (tried amodem but it kept crashing).
Data over video
With the basic function of the VPN being "showing the user his desktop" I can use this to get data out via some type of encoding.
I picked QR codes mostly because there are available libraries, as far as I'm aware a data matrix is a better way to encode this.
The pyzbar library is fully written in python, which usually means it is "slow" (at least for this unexpected use-case) -- codes containing ~1KB of payload were taking up to 500ms to render. I gave PyPy a shot and it took the runtime for those big generations down to ~75ms. Still bad, but good enough.
- Framerate is not super high and it clearly skips when there are a lot of changes
- Capture is not synchronized with printing, so out-of-sync capture = packet loss
- Color usage? Might get obliterated by H264
- Running the remote desktop at a lower resolution? Might increase refresh rate / lessen frame skips.
- Data matrix instead of a QR code? Couldn't find anything conclusive on "density" of each
Piping data as a network connection
Linux has a very handy feature: user-mode networking via TUN/TAP interfaces.
They are great and super easy to use -- after you run a few
ioctl calls you get a file descriptor on which you can
write and it will be sent over the network!
The main distinction between TUN and TAP (that I know of) is that TAP is a Layer-2 interface, meaning that it will wrap everything that you send in an Ethernet frame (so if you wanted to communicate via IP, you are paying the ethernet frame size overhead as well), also because of this interface being L2, it needs the destination MAC addresses, which it should get from an ARP request/reply, but I couldn't get the remote side to reply to the ARP requests it got (either way).
A TUN interface is purely L3 -- it only sends raw/bare IP packets, less overhead, no ARP, great!
The implementation is fairly straightforward:
- Open a TUN network device
- Read data from the input covert channel and write it to the TUN (outgoing data)
- Read data from the TUN and write it to the output covert channel (incoming data)
Specifically the listed cases are:
- VPN IN: clipboard -> TUN
- VPN OUT: TUN -> audio card
- Desktop IN: audio card -> TUN
- Desktop OUT: TUN -> clipboard
- VPN IN: clipboard -> TUN
- VPN OUT: TUN -> screen (print QR code)
- Desktop IN: screen (capture) -> TUN
- Desktop OUT: TUN -> clipboard
Everything I implemented here was done badly, in a single afternoon. There are no robustness mechanisms, everything is up to the upper layers, even though some basic schemes could improve the usefulness of this dramatically.
There's no acknowledgement of messages received, batching, anything. The only thing that's there is a "sequence number" simply to check for "uniqueness" in the clipboard, as it can't be cleared (no ACKing of messages, remember?!).
Everything in the video is running inside the VPN, the traffic shown is all going to my machine via the covert channels.
CLIP IN : 2.47MiB 0:01:00 = 42KB/s (via nc so the bottleneck is still QR parsing for ACKs) QR OUT: 359KiB 0:01:00 = 6KB/s -> parsing QR takes 70-175ms (including capture)
AUDIO OUT: 9,48KiB 0:01:00 = 0.15KB/s (!)
Ping over audio <-> clipboard Be careful with your sound level
Ping over QR code <-> clipboard
Mosh over QR code <-> clipboard
Things to investigate
With latency being at least 100ms one-way (clipboard) and ~33-100ms the other way (33ms for a frame at 30FPS and upper threshold of 100 for the capture) acknowledgement for messages would improve everything, as would aggregation of queued packets (the trivial implementation does not aggregate).