Summary of Transformers Operating Directly on File Bytes

Summary Transformers Operating Directly on File Bytes arxiv.org

7,405 words - PDF document - View PDF document

One Line

The ByteFormer model is a transformer-based neural network that operates directly on file bytes for privacy-preserving inference on obfuscated inputs, achieving high accuracy on various input modalities without requiring modifications or hyperparameter tuning.

Key Points

The Perceiver model includes domain-specific modeling and removes positional embedding for image classification on ImageNet.
ByteFormer is a privacy-preserving model that operates directly on file bytes and achieves high accuracy on various input modalities.
The choice of file encoding affects the accuracy of ByteFormer, and certain augmentations can improve accuracy.
The model uses a transformer backbone with a configuration similar to DeiT-Ti and achieves state-of-the-art accuracy on several datasets.
The method can be used as a building block for obfuscating inputs to a learning system and avoids the need for modality-specific preprocessing.

Summaries

166 word summary

The document discusses the development of models for operating directly on file bytes, with a focus on efficiency and accuracy. The authors compare their approach to previous works and report train time and throughput metrics. They also discuss the importance of characterizing runtime and avoiding hardware inconsistencies. The ByteFormer model is a transformer-based neural network that operates directly on file bytes for privacy-preserving inference on obfuscated inputs. It achieves high accuracy on various input modalities, including images and WAV files, without requiring modifications or hyperparameter tuning. The model maintains privacy by obfuscating input byte values using a permutation function and a custom camera that captures a privacy-preserving representation. The choice of file encoding is important when performing inference on file bytes. The ByteFormer model uses transformers with learned token embeddings and attention to handle long sequence lengths when operating directly on file bytes. The model data format includes flattened tensors in height, width, channel order for image bytes stored in HWC order without any file headers.

485 word summary

ByteFormer is a deep learning model that directly operates on file bytes for classification tasks without decoding them into modality-specific representations. It achieves high accuracy on various input modalities, including images and WAV files, without requiring modifications or hyperparameter tuning. The model maintains privacy by obfuscating input byte values using a permutation function and a custom camera that captures a privacy-preserving representation. The choice of file encoding is important when performing inference on file bytes. The ByteFormer model uses transformers with learned token embeddings and attention to handle long sequence lengths when operating directly on file bytes. The model data format includes flattened tensors in height, width, channel order for image bytes stored in HWC order without any file headers. The ByteFormer model is a transformer-based neural network that operates directly on file bytes for privacy-preserving inference on obfuscated inputs. The model can handle various input modalities without converting them into a standard input representation. The architecture of the model is ByteFormer Tiny (BF-Ti), which is based on the DeiT-Ti architecture with an embedding dimension of 192. The model is used for altering input representations and obfuscating regions of constant color in images without retraining. The optimal values for hyperparameters w and k depend on the type of file being encoded, with w = 128 and k = 32 being the best for audio files. The study compares different attention methods for Transformers operating on file bytes, finding window attention to be more effective than bag attention. The article discusses a new method called ByteFormer, which directly operates on file bytes to create vector representations of files. ByteFormer is a model that operates directly on file bytes, which can be used with image obfuscation techniques to provide security against attackers. The model's accuracy depends on the file encoding chosen, and future work includes adding invariance to file encodings and experimenting with other domains and tasks. The document references various related works and techniques, including mixup, ConvMixer, and secure computation for neural networks. The authors aim to create a model that can directly model file bytes with privacy-preserving applications.

The Perceiver model includes domain-specific modeling at inference time and removes the positional embedding. BF-Ti and DeiT are two models compared to Perceiver for image classification on ImageNet. Perceiver achieves comparable accuracy with fewer flops than BF-Ti and DeiT. Experimental results are reported for camera experiments and multi-modal modeling.

The excerpted text contains a list of references to articles and papers on various topics related to computer vision, machine learning, and cryptography. The references cover topics such as deep learning, vision transformers, object detection, and secure training with GPUs.

Overall, the document discusses the development of models for operating directly on file bytes, with a focus on efficiency and accuracy. The authors compare their approach to previous works and report train time and throughput metrics. They also discuss the importance of characterizing runtime and avoiding hardware inconsistencies.

1370 word summary

The Perceiver model includes domain-specific modeling at inference time and removes the positional embedding. BF-Ti and DeiT are two models compared to Perceiver for image classification on ImageNet. BF-Ti has a small kernel size and low fraction of retained pixels for privacy-preserving purposes. DeiT has a small patch size. Perceiver has a slower forward pass time but achieves comparable accuracy with fewer flops. The model's size and accuracy are similar to DeiT-Ti and DeiT-S, respectively. The model's training time is improved by a relatively small amount through secure featurization. Experimental results are reported for camera experiments and multi-modal modeling. This is a technical document discussing the development of models for operating directly on file bytes, with a focus on efficiency and accuracy. The authors compare their approach to previous works and report train time and throughput metrics. They also discuss the importance of characterizing runtime and avoiding hardware inconsistencies. The document references various related works and techniques, including but not limited to: mixup, ConvMixer, secure computation for neural networks, and feature interactive convolution. The authors aim to create a model that can directly model file bytes with privacy-preserving applications. The excerpted text contains a list of references to articles and papers on various topics related to computer vision, machine learning, and cryptography. The references cover topics such as deep learning, vision transformers, object detection, and secure training with GPUs. Some of the notable references include a paper on the Swin transformer, a hierarchical vision transformer using shifted windows, and an article on the discrete cosine transform. ByteFormer is a model that operates directly on file bytes, consuming only bytes and not explicitly modeling the input modality. It can be used in conjunction with image obfuscation techniques to provide security against an attacker with access to a large set of model inputs. However, the obfuscation method does not provide cryptography-level security. The accuracy of ByteFormer depends on the file encoding chosen, and adding invariance to file encodings is future work. The method has only been evaluated on classification for images and audio, and experimenting with other domains and tasks is future work. The choice of file encoding affects the accuracy of ByteFormer, and using TIFF results in a reduction of accuracy on ImageNet compared to using JPEG. Future work includes exploring fine-grained localization for detection and segmentation tasks. The article discusses a new method called ByteFormer, which directly operates on file bytes to create vector representations of files. The authors visualize learned position and token embeddings and experiment with various augmentations to understand the behavior of ByteFormer. They find that the model is sensitive to locality and byte ordering, and that certain augmentations can improve accuracy. The authors also show results for various file encodings on ImageNet and Speech Commands. The study compares different attention methods for Transformers operating on file bytes, finding window attention to be more effective than bag attention. The experiments are conducted using TIFF and JPEG images. The study also includes an analysis of the privacy-preserving camera setup, which masks pixel channels of ImageNet images at different percentages. The results show that the model is resilient to this transformation, but DeiT is not. The study provides Top-1 ImageNet accuracy of BF-Ti with different types of attention and includes illustrations of the attention methods used. Table 5 summarizes the results for the privacy-preserving camera experiment. The article presents a method for file encoding called ByteFormer Tiny (BF-Ti), which uses byte remapping to obfuscate images and audio files. The method retains shape information and achieves high accuracy in audio classification and image recognition tasks. The optimal values for hyperparameters w and k depend on the type of file being encoded, with w = 128 and k = 32 being the best for audio files. The method is computationally efficient and outperforms previous methods in some cases. Results for different noise levels and JPEG quality factors are presented, and the method is compared to related works. ByteFormer is a model that operates directly on file bytes for image classification tasks. It achieves high accuracy by reducing token lengths and using lower quality factors. The model's kernel size is reduced for JPEG images, which have a smaller token length than TIFF or PNG. ByteFormer outperforms DeiT-Ti accuracies on the ImageNet dataset and is trained with exponential moving average of weights. For Speech Commands V2, ByteFormer is trained with MixUp and other augmentations. ByteFormer is also applied to privacy-preserving camera applications. The ByteFormer model is a transformer-based neural network designed to operate directly on file bytes, allowing for privacy-preserving inference on obfuscated inputs. The model can handle a variety of input modalities without converting them into a standard input representation. To reduce the sequence length, Conv1D is used, and positional embeddings are added to the token embeddings. The model can be used for altering input representations and obfuscating regions of constant color in images without retraining. The architecture of the model is ByteFormer Tiny (BF-Ti), which is based on the DeiT-Ti architecture with an embedding dimension of 192. The kernel size is typically set to 32, and stride is always k/2. The sequence length is reduced with down-sampling layers and shifted window attention. The model can be used against an adversary depending on the threat model used and the alternative encodings available. More sophisticated methods can also be used to improve security. The ByteFormer model uses transformers with learned token embeddings and attention to handle long sequence lengths when operating directly on file bytes. The model uses strided Conv1D and shifted window attention to handle various file encodings, which can exceed 150,000 tokens. At inference time, the model does not require knowledge of the input modality and can be used for image and audio file encodings. The training method includes standard training augmentation and re-encoding files with different file encodings. The model is compared to DeiT-Ti on ImageNet top-1 accuracy using various file encodings. The model data format includes flattened tensors in height, width, channel order for image bytes stored in HWC order without any file headers. The choice of file encoding is important when performing inference on file bytes. For example, an image stored as a JPEG or PNG file will be decoded differently. MP3 encodings can be difficult to handle, but the pydub software package can help. WAV files are stored as audio signals and can be compressed using Huffman encoding. JPEG encodes images by applying a series of transformations to compress the image before serialization. PNG contains headers describing configuration options and the image size. TIFF allows for many custom configurations and can be stored in "CHW" order. The use of image input masking with a single-pixel camera is common in some works. ByteFormer is a privacy-preserving model that performs inference on file bytes without requiring image capture. The model is based on Transformers and can classify images directly from file bytes. ByteFormer achieves strong performance on a variety of image and audio file encodings, with 90% of the pixels masked during training. The model maintains privacy by obfuscating input byte values using a permutation function and a custom camera that captures a privacy-preserving representation. The method can be used as a building block for obfuscating inputs to a learning system and avoids the need for modality-specific preprocessing. ByteFormer is a model that directly operates on file bytes for classification tasks without decoding them into modality-specific representations. It uses a modified Transformer architecture and can handle various input representations, including images and audio files. The model achieves state-of-the-art accuracy on several datasets and can operate on privacy-preserving inputs to protect user privacy. The traditional practice of decoding inputs into modality-specific representations has two main drawbacks, requiring hand-crafted input representations and reducing privacy. ByteFormer is a deep learning model that operates directly on file bytes, avoiding the need to decode files at inference time. It achieves high accuracy on various input modalities, including images and WAV files, without requiring modifications or hyperparameter tuning. ByteFormer also has applications in privacy-preserving inference, as it can perform inference on obfuscated inputs. The model uses a transformer backbone with a configuration similar to DeiT-Ti and achieves an ImageNet Top-1 classification accuracy of 77.33%. The code for ByteFormer will be made available to the public.

Raw indexed text (47,090 chars / 7,405 words / 1,246 lines)

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

Apple

Abstract

Modern deep learning approaches usually transform

inputs into a modality-specific form. For example, the

most common deep learning approach to image classi-

fication involves decoding image file bytes into an RGB

tensor which is passed into a neural network. Instead,

we investigate performing classification directly on file

bytes, without the need for decoding files at inference

time. Using file bytes as model inputs enables the de-

velopment of models which can operate on multiple

input modalities. Our model, ByteFormer, achieves an

ImageNet Top-1 classification accuracy of 77.33% when

training and testing directly on TIFF file bytes using

a transformer backbone with configuration similar to

DeiT-Ti (72.2% accuracy when operating on RGB im-

ages). Without modifications or hyperparameter tuning,

ByteFormer achieves 95.42% classification accuracy when

operating on WAV files from the Speech Commands v2

dataset (compared to state-of-the-art accuracy of 98.7%).

Additionally, we demonstrate that ByteFormer has appli-

cations in privacy-preserving inference. ByteFormer is

capable of performing inference on particular obfuscated

input representations with no loss of accuracy. We also

demonstrate ByteFormer’s ability to perform inference

with a hypothetical privacy-preserving camera which

avoids forming full images by consistently masking 90%

of pixel channels, while still achieving 71.35% accu-

racy on ImageNet. Our code will be made available

https://github.com/apple/ml-cvnets/

tree/main/examples/byteformer.

1. Introduction

Deep learning inference usually involves explicit model-

ing of the input modality. For example, Vision Transform-

ers (ViTs) [7] explicitly model the 2D spatial structure of

images by encoding image patches into vectors. Similarly,

audio inference often involves computing spectral features

(such as MFCCs [25]) to pass into a network [10, 18]. When

a user wants to perform inference on a file stored on disk

(e.g. a JPEG image file or an MP3 audio file), the user must

first decode the file into a modality-specific representation

(e.g. an RGB tensor or MFCCs), as in Figure 1a.

The practice of decoding inputs into a modality-specific

representation has two main drawbacks. First, it requires

hand-crafting an input representation and a model stem for

each input modality. Recent works such as PerceiverIO

[14] and UnifiedIO [24] have shown that Transformer back-

bones can be used for a variety of different tasks. However,

these methods still require modality-specific input prepro-

cessing. For instance, PerceiverIO decodes image files into

[H × W, C] tensors before passing them into the network.

Other modalities input to PerceiverIO are processed into

different forms. We hypothesize that it’s possible to remove

all modality-specific input preprocessing by performing in-

ference directly on file bytes.

The second drawback of decoding inputs into a

modality-specific representation is that it reduces privacy

by exposing the data being analyzed. Consider the case of a

smart-home device that performs inference on RGB images.

If an adversary accesses this model input, the user’s privacy

might be compromised. We hypothesize that inference can

instead be performed on privacy-preserving inputs.

To address these drawbacks, we note that a common

property of many input modalities is that they can be stored

as file bytes. Thus, we use file bytes (without any decoding)

as inputs to our model at inference time (Figure 1b). We

use a modified Transformer [39] architecture for our model,

given their ability to handle a variety of modalities [14, 24]

and variable-length inputs. We call our model ByteFormer.

We demonstrate the efficacy of ByteFormer on Ima-

geNet [6] classification, achieving 77.33% accuracy on files

stored in the TIFF format. Our model uses transformer

backbone hyperparameters chosen in DeiT-Ti [38] (which

achieves 72.2% accuracy on RGB inputs). We also demon-

strate strong results on PNG and JPEG files. Additionally,

we demonstrate that our classification model can achieve

95.8% accuracy on Speech Commands v2 [42], compara-

ble to state-of-the-art (98.7%) [18], without any architec-

ture changes or hyperparameter tuning.

Because ByteFormer can handle a variety of input rep-

resentations, we can also use it to operate on privacy-

preserving inputs. We demonstrate that we can remap in-A

File Decoding

0xFF 0x01

...

Patch Embedding

...

0x8A

File Bytes

RGB Tensor

Tokenized Representation

Token Embedding

0xFF 0x01

...

0x8A

File Bytes

...

Tokenized Representation

Obfuscation ϕ

0xFF 0x01

0x8A

File Bytes

DeiT

Token Embedding

0x1C 0x78

...

Obfuscated File Bytes

Privacy-Preserving Capture

0xAF 0x29

Privacy-Preserving Camera

...

0xB2

Tokenized Representation

Token Embedding

...

0x31

Privacy-Preserving Representation

...

Tokenized Representation

Figure 1. An overview of our ByteFormer (BF) compared to standard inference with DeiT [38]. (A): File bytes are read from disk and

converted to an RGB tensor using a standard image decoder. A patch embedding creates tokens from the RGB representation. (B): File

bytes on disk are directly used as tokens, and projected into learned embeddings. (C): Similar to (B), but we apply an obfuscation function

ϕ. (D): We capture a privacy-preserving representation with a custom camera, and perform token embedding from this representation.

put byte values using a permutation function ϕ : [0, 255] →

[0, 255] (Figure 1c) to obfuscate inputs without losing accu-

racy. Although this does not guarantee cryptography-level

security, we demonstrate how this method can be used as a

building block for obfuscating inputs to a learning system.

Stronger privacy can be obtained by performing infer-

ence with ByteFormer on a partially-formed image (Fig-

ure 1d). We demonstrate that ByteFormer is capable of

training on images with 90% of the pixels masked while still

achieving 71.35% accuracy on ImageNet [6]. ByteFormer

does not require information about the specific location of

unmasked pixels. Our representation passed to our model

maintains privacy by avoiding a standard image capture.

In summary, our contributions are: (1) we develop Byte-

Former, a model capable of performing inference on file

bytes. (2) We show that ByteFormer achieves strong per-

formance on a variety of image and audio file encod-

ings, without the need for architectural changes or hy-

perparameter tuning. (3) We demonstrate application of

ByteFormer to privacy-preserving inputs. (4) We ana-

lyze properties of ByteFormers trained to classify im-

ages and audio directly from file bytes. (5) We will

release our code at https://github.com/apple/

ml-cvnets/tree/main/examples/byteformer.

2. Related Work

Architectures With Multimodal Inputs: A few meth-

ods have explored the idea of feeding different input modal-

ities into the same network for processing. Perceiver IO

[14] demonstrates that a Transformer [39] architecture with

cross-attention input can be used for a variety of different

tasks. Their method differs from ours because their inputsare processed with modality-specific preprocessing. For ex-

ample, images are loaded into a [H × W, C] buffer, which

differs from the model’s treatment of text. By contrast, our

method can classify images directly from file bytes. To our

knowledge, we are the first to explore models that directly

consume file bytes without modality-specific processing.

Other recent works that model multiple modalities with

a single model or a single embedding space (but still require

input-specific processing) include Unified IO [24], CLIP

[33], and LQAE [21].

Alternate Image Input Representations: Previous

works have explored using alternate input representations

for images. In [11], the authors perform partial JPEG de-

coding, stopping when Discrete Cosine Transform [26] co-

efficients are formed. They modify CNNs [12] to ingest

this new representation. In [31], a similar method is used

with Transformers. Our work differs in that we perform no

decoding of file bytes at inference time.

Privacy-Preserving Inference: We demonstrate appli-

cations of our model to privacy-preserving inference. A few

works have examined secure inference in Multi-Party Com-

munication (MPC) settings [41, 19, 35, 37, 16]. The focus

of these works is to perform inference securely on a remote

machine using decrypted data on a local machine. We differ

from these methods in that our privacy-preserving systems

add a layer of privacy to the data on a local machine, and

inference is performed locally. Thus, our approach comple-

mentary to theirs.

Compressive Sensing: Our privacy-preserving camera

is inspired by works in compressive sensing [8]. Related

works use image input masking with a single-pixel camera

to capture an image over multiple exposures with different

masks [9, 13]. Instead, we experiment with a single masked

capture on a multi-pixel camera. It is not common to store images in this way, since they

cannot be decoded without pre-existing knowledge of their

height and width. This serves as a strong baseline.

fCHW: This format is similar to fHWC, but images are

stored in “CHW” order.

TIFF: The TIFF file encoding [32] allows for many cus-

tom configurations. For our experimentation, we use the de-

fault settings provided by PIL. This results in a format sim-

ilar to fHWC, but with the addition of TIFF image headers

describing configuration options and the image size.

PNG: The PNG format [2] contains headers describing

PNG configuration options, followed by rows of image data

stored in “IDAT” chunks. Each IDAT chunk contains a byte

describing the filtering method used for that row’s data. The

filtering method applies an offset to the row’s data based on

neighboring pixel values. Thus, our PNG file contains rows

of RGB data, with offsets applied, interrupted by occasional

bytes that contain file encoding settings. We do not use the

optional zlib compression that PNG allows.

JPEG: JPEG [43] encodes images by applying a series

of transformations to compress the image before serializa-

tion. The RGB image is converted to YCbCr, then down-

sampled in the chroma channels, then passed through a Dis-

crete Cosine Transform [26], then quantized using coeffi-

cients determined by the JPEG quality factor. The quality

factor determines the level of compression, with 100 denot-

ing no compression due to quantization, and lower values

indicating stronger compression. After quantization, the co-

efficients are encoded via a run-length encoding, followed

by a Huffman encoding [34]. Note that Huffman codes are

not byte-aligned, e.g. they can cross byte boundaries. We

expect this to make our modeling task more difficult.

3. Overview of Common File Encodings WAV: The WAV file encoding [17] stores audio signals

represented as a sequence of amplitudes. We use single-

channel (mono) audio files. The most common configu-

ration options are the bit depth and the frequency. The

bit depth corresponds to the precision with which ampli-

tude values are stored. We experiment with a variety of bit

depths, storing audio with 8-bit unsigned integers, 16-bit in-

tegers, 32-bit integers, and 32-bit floating-point values. The

frequency corresponds to how often amplitude values are

chosen. We use 16 kHz, a standard choice for audio [42].

MP3: MP3 [30] uses a perceptual compression method

that removes portions of audio that are difficult for humans

to detect. The remaining portions are recorded in frequency

space. An MP3 file contains a series of frames. Each frame

contains a header with encoding settings, followed by the

encoded signal in frequency space. We use standard settings

for MP3 provided by the pydub [36] software package. We

expect MP3 encodings to be more difficult to handle than

WAV files due to the compression applied.

When performing inference with a standard model, the

choice of file encoding is irrelevant. For example, it doesn’t

matter whether an image is stored as a JPEG or PNG file if

it will be decoded into an RGB tensor. By contrast, Byte-

Former performs inference on file bytes. In this case, the

choice of file encoding matters. This section provides an

overview of common file encodings for images (subsec-

tion 3.1) and audio (subsection 3.2). File encoding methods

typically contain a large number of optional settings that in-

fluence the resulting file bytes. We use default settings pro-

vided by PIL [4] or scipy [40] software packages unless

otherwise stated.

3.1. Image File Encodings

fHWC: We use fHWC as an abbreviation for “flattened

tensors in height, width, channel order.” It refers to uint8

image bytes stored in HWC order without any file headers.

3.2. Audio File EncodingsModel Data format E[S] E[L t ] Top-1

DeiT-Ti

DeiT-Ti ⋆ RGB Tensor

RGB Tensor 3 × 224 × 224

3 × 224 × 224 196

196 72.2

74.35

fHWC

fCHW

TIFF

PNG

JPEG 150528

150528

150668

150864

48564 9407

9407

9415

9428

12140 77.06

74.65

77.33

74.94

65.92

Transformer

BF-Ti (Ours)

Conv1D

Token Embedding

Table 1. ImageNet Top-1 accuracy of ByteFormer Tiny (BF-Ti)

using various file encodings, compared to DeiT-Ti. E[S] denotes

the input shape, and E[L t ] denotes the token length passed to the

transformer backbone. ( ⋆ ) denotes our implementation of DeiT-Ti.

We set BF-Ti’s Conv1D kernel size to k = 32 for all experiments

except JPEG (k = 8).

of the validation images and save them in the TIFF format.

Such preprocessing is only necessary because the ImageNet

validation set is not already stored in the desired format.

0xFF 0xAC 0x17 0xFA 0x28 0x01 0xB2 0x87

Figure 2. An overview of ByteFormer. We map byte values to

learned vectors using a learned token embedding. Next, we ap-

ply a Conv1D to reduce the token dimension. Finally, we apply a

transformer with shifted window attention and downsampling.

4. Methods

First, we discuss our method for performing inference

on file bytes (subsection 4.1). Then, we discuss how to use

our method with image obfuscation techniques to enable

privacy-preserving inference (subsection 4.2). Finally, we

discuss how to use our method with a privacy-preserving

camera to perform inference without constructing full im-

ages (subsection 4.3).

4.1. Inference on File Bytes

4.1.1

Preprocessing

Some of our file encodings such as TIFF are not frequently

used in machine learning datasets. To allow for compar-

isons on a single dataset across a variety of file encodings,

we must re-encode files with different file encodings.

At training time, we decode the file (e.g. read the con-

tents into an RGB tensor in the case of images), then per-

form standard training augmentation (e.g. random cropping

in the case of images), then save the result in the desired file

encoding. We find that standard training augmentation is

important for model accuracy. Thus, our training method is

implicitly dependent on the input modality due to our aug-

mentation.

At inference time, we do not need knowledge of the in-

put modality. We only need to ensure that our model inputs

use the correct file encoding. For example, for TIFF ex-

periments on ImageNet, we precompute 224 × 224 crops

4.1.2

ByteFormer

We describe our ByteFormer model for inference on file

bytes. An overview of our model is given in Figure 2. The

main challenge in using file bytes as inputs is the long se-

quence lengths. In Table 1, we observe that input sizes E[S]

for various file encodings can exceed 150, 000 tokens. As

described below, we use strided Conv1D and shifted win-

dow attention [22] to handle long sequence lengths.

The first step of our model is to use a learned token em-

bedding with a vocabulary size of 256 (corresponding to 2 8

unique byte values) to produce embeddings. This choice

allows our model to handle a variety of input modalities.

The next step of our model is to perform a Conv1D

to reduce the sequence length. Our intuition for choosing

Conv1D is that neighboring file bytes often contain related

information. Reducing our sequence length with Conv1D

greatly improves memory usage. In Table 1, E[L t ] denotes

the input size to our Transformer, which is significantly

smaller than E[S]. Typically, we set our kernel size k = 32.

Our stride is always k/2.

Next, we add positional embeddings to the token em-

beddings, then pass our embeddings to a Transformer. We

choose Transformer size parameters to match the 12-layer

DeiT-Ti [38] architecture with embedding dimension 192.

We call this particular version of our architecture Byte-

Former Tiny (BF-Ti). To compensate for our long sequence

length (9417 for TIFF, compared to 196 for DeiT-Ti), we

use shifted window attention [22] to limit the attention win-

dow size w, alleviating the quadratic complexity of atten-

tion layers on sequence length. We also add down-sampling

layers to halve the sequence length, as in [22]. We add them

after transformer blocks 0, 1, 3, 5, 7, and 9. After passing

our tokens through the transformer, we average the embed-dings across the sequence dimension.

4.2. Inference on Obfuscated Inputs

ByteFormer is designed to perform inference on file en-

codings without converting them into a standard input rep-

resentation (e.g. an RGB tensor in the case of images).

Therefore, we explore whether ByteFormer can be used for

inference on privacy-preserving representations that obfus-

cate information about the underlying data (Figure 1c).

Consider a permutation ϕ : {0, 1, 2, . . . , 255} →

{0, 1, 2, . . . , 255}. Let τ denote a token embedding, and let

f θ denote the subsequent transformer. It’s easy to see that,

for a given ϕ, there exists a τ ϕ −1 such that τ ϕ −1 (ϕ(x)) =

τ (x). τ ϕ −1 is simply a copy of τ with embedding vec-

tors reassigned to different indices. Thus, f (τ ϕ −1 (ϕ(x))) =

f (τ (x)). The implication of this statement is that our net-

work f θ can operate on re-encoded inputs ϕ(x) without re-

quiring any retraining as long as the network’s token em-

bedding τ is reordered to τ ϕ −1 .

To take advantage of this property, we choose a permuta-

tion ϕ at random before training. All training and inference

inputs are remapped using ϕ. We optionally apply uniform

noise before applying ϕ. Without uniform noise, ϕ can be

applied to a standard ByteFormer without retraining (as ex-

plained above). However, we find uniform noise helpful in

obfuscating regions of constant color in our experiments.

More generally, we can use more sophisticated methods

for altering input representations. As our method can handle

highly nonlinear JPEG encodings, we expect it to perform

well on a variety of alternative encodings that an outside

observer might not be able to easily guess. How secure are

such methods against an adversary? This analysis depends

on the threat model used. For example, if an adversary has

access to a large number of encoded samples ϕ(x), analy-

sis of byte statistics might suggest that strings of common

bits correspond to patches of blue sky in images. The ad-

versary’s task is made more difficult given certain file en-

codings (e.g. the highly nonlinear JPEG encoding). We

do not make strong claims regarding the level of security

provided by different choices of ϕ. Secure systems should

be designed and analyzed by security researchers. Instead,

we simply suggest that decoupling the input representation

from the model can lead to new possibilities for building

more secure systems.

4.3. Privacy-Preserving Camera

We describe another application of ByteFormer to

privacy-preserving inference (Figure 1d). In this scenario, a

custom camera captures a non-standard, privacy-preserving

representation to allow for inference without building a full

RGB image. This custom representation could take a vari-

ety of forms. In our experimentation, we consider a hypo-

thetical camera that masks out a large fraction of its pixel

channels. The camera stores the remaining unmasked pixel

channels in an array without retaining the coordinates of

pixel channels on the image sensor. In this scenario, an ad-

versary could not obtain a faithful reconstruction of the in-

put image. Even if the adversary could guess pixel channel

locations, the low resolution of captured data prevents the

adversary from recovering a high-fidelity image.

5. Experiments

We evaluate ByteFormer on 1000-way classification on

ImageNet [6]. We also evaluate 12-way audio keyword

classification (including “background” and “unknown”

classes) of 1-second audio clips sampled at 16 khz using

Speech Commands v2 [42]. For all experiments, Byte-

Former’s backbone uses hyperparameters that match DeiT-

Ti [38]. We refer to this architecture as BF-Ti.

We train using CVNets [27]. For ImageNet, we use batch

size 48 on 8 NVIDIA A100 GPU machines. At training

time, we use random resized cropping, random horizontal

flipping, RandAugment [5], and RandomErase [45] before

storing the image in the desired file encoding (subsubsec-

tion 4.1.1). We train with AdamW [23] with weight decay

0.05, and a cosine annealing learning rate schedule from

0.001 to 0.00002, with 7500 warmup iterations.

We train our Speech Commands v2 with MixUp [44],

noise augmentation, and time shifting augmentation, as in

[29]. Our training and architecture hyperparameters match

our ImageNet experiments. We train these models on 4

NVidia A100 GPU machines.

For ImageNet experiments, we report Top-1 accuracy of

models trained with exponential moving average of weights

with momentum 0.0001, which typically increased accu-

racy by roughly 0.25%. For Speech Commands V2, we

found EMA to sometimes increase and sometimes decrease

accuracy, so we omit it.

5.1. ImageNet File Encodings

Table 1 summarizes results for a variety of file encodings

on the ImageNet dataset. For BF-Ti, we use w = 128 and

k = 32 for all models except JPEG, for which we find k = 8

to perform better. Our method surpasses DeiT-Ti accuracies

for TIFF, PNG, fCHW, and fHWC encodings.

We find training on JPEG to be more difficult. This is

likely due to the highly nonlinear and variable-length JPEG

encoding. We investigate the influence of our model’s ker-

nel size k on JPEG accuracy in Table 2. We find that re-

ducing k from its default value of 32 increases accuracy.

Since JPEG images have a smaller token length than TIFF

or PNG, they are likely less compressible. To further ex-

plore this, we investigate two settings for JPEG quality fac-

tor in Table 2. We find that lower quality factors result in

lower token lengths, thus reducing k improves accuracy. We

also try reducing w, but accuracy does not improve.k E[S] Top-1

100

100 128

128

128 32

8 48564

48564

48564 60.86

64.86

65.92

60 128

128

128 32

4 8436

8436

8436 31.8

50.11

56.26

62.52

60 32

32 32

4 8436

8436

8436 37.23

50.24

56.74

59.52

Noise level

None

U[−5, 5]

U[−10, 10]

U[−20, 20]

None

Input w k E[S] Top-1

BC-ResNet-8 log Mel - - 40 × 98 98.70

BF-Ti (Ours) W-FP32 128

128 32

16 64058

64058 95.80

95.51

BF-Ti (Ours) W-INT32 128

128 32

16 64044

64044 94.90

95.27

BF-Ti (Ours) W-INT16 128

128

128 32

8 32044

32044

32044 94.81

95.51

95.13

W-UINT8 128

128

128 32

4 16044

16044

16044 92.28

94.39

94.81

93.99

BF-Ti (Ours)

DeiT-Ti BF-Ti

51.61

50.77

49.50

43.84 77.39

77.27

77.17

76.31

Table 4. ImageNet Top-1 results for obfuscation with ϕ. We show

results with no noise, and with uniform noise in [−a, a] added. We

use the fHWC encoding.

Table 2. ImageNet Top-1 accuracy for ByteFormer Tiny (BF-Ti)

for different JPEG quality factors q, window sizes w, and convo-

lutional kernel sizes k. E[S] denotes the expected shape of the

inputs during validation.

Model

128 8

3465

88.39

128 4

3465

88.00

BF-Ti (Ours)

MP3

3465

88.69

3465

89.19

Table 3. Results for audio classification with BF-Ti on the Speech

Commands v2 dataset. “W-” denotes WAV files with the given bit

width. E[S] denotes the shape of network inputs.

We present our method’s computational efficiency com-

pared to related works in Appendix A.

5.2. Speech Commands v2 File Encodings

Results for audio classification on Speech Commands

v2 [42] are given in Table 3. BF-Ti achieves accuracies

of up to 95.51% on WAV files, comparable to the state-of-

the-art method BC-ResNet-8 [18]. Note that BC-ResNet-8

is specifically designed for audio processing. By contrast,

we performed no parameter tuning relative to our ImageNet

training recipe (besides ablating choices of w and k). Our

best-performing model has w = 128 and k = 32. Our

U[-5, 5]

U[-10, 10]

U[-20, 20]

obfuscated

Figure 3. A sample image from the ImageNet validation set, with

uniform noise applied (top row), and with byte remapping ϕ addi-

tionally applied (bottom row).

model performs best on floating-point values. In this case,

since each 32-bit floating-point value in the audio signal

will be encoded as 4 file bytes, each audio sample will be

represented by 4 neighboring tokens before our Conv1D.

We investigate the influence of k on model accuracy. In

general, the optimal k decreases when the expected number

of input tokens decreases. This matches our observations in

ImageNet JPEG experiments. For MP3 files, we observed

that k = 32 resulted in unstable models due to the drastic

reduction in token length. For MP3, we additionally exper-

iment with w = 32, but it does not improve results.

5.3. Image Obfuscation

Results for our image obfuscation method (subsec-

tion 4.2) on ImageNet are summarized in Table 4. After

obtaining our fHWC encoding, we apply a randomly cho-

sen obfuscation function ϕ.

Examples of obfuscated images are shown in Figure 3.

We observe that byte remapping retains shape information.

A region of the image that is dominated by a single pixel

value will continue to be dominated by a new (remapped)

pixel value. To alleviate this, we add noise from a uniform

distribution U[−a, a] sampled from −a to a (inclusive) to

each pixel channel independently, then compute the result

modulo 256. Afterwards, we apply ϕ. This prevents re-

gions of constant pixel value from being remapped to a sin-

gle value. As shown in Figure 3, the upper right corner ofKept

Top-1

75%

74.77

50%

75.36

25%

74.04

10%

71.35

68.11

64.15

Table 5. ImageNet Top-1 accuracy for our privacy-preserving

camera experiment with BF-Ti when the given fraction of pixel

channels are kept.

100%

25%

10%

5.4. Privacy Preserving Camera

Table 5 summarizes our results for our privacy-

preserving camera (subsection 4.3). We emulate the cam-

era setup by masking pixel channels of ImageNet images at

random, then storing unmasked pixels in a buffer (in fHWC

order) and passing that buffer into our network. For these

experiments, we cannot provide DeiT-Ti baselines because

DeiT-Ti is not capable of ingesting pixel values without any

indication of their placement in the image.

In Figure 4, we show masked inputs before the unmasked

pixels are rasterized. At 10% pixel retention, the content

of image is hard to visually perceive even though active

pixels are placed side-by-side in a new buffer. Even if

an adversary correctly guessed the positions of unmasked

pixel channels in the original image, the adversary could

not former a high-fidelity image. As shown in Table 5,

our accuracy at 10% pixel retention is 71.35%, compara-

ble to the original DeiT-Ti model operating on non-privacy-

preserving (unmasked) images.

Note that this privacy-preserving technique can be com-

bined with the byte remapping technique (subsection 4.2) to

further obfuscate network inputs.

6. Analysis

Alternate Attention Methods: We study three state-of-

the-art self-attention methods for handling long sequence

lengths in Transformers: (1) full self-attention [39, 7, 20]

where each token attends on every other token, (2) shifted

Full

Window

Bag OOM

77.33

75.20

0 0

1 1

2 2

3 3

Full Attention

0 2

1 3

4 6

5 7

Window Attention

the image becomes less recognizable as noise from progres-

sively larger ranges is used. In Table 4, we observe that our

method is resilient to this transformation, but DeiT is not.

Top-1

Table 6. ImageNet Top-1 accuracy of BF-Ti with different types

of attention. We run out of memory with full attention.

Figure 4. An ImageNet validation image captured by our hy-

pothetical privacy-preserving camera in which the given fraction

of pixel channels are kept. Note that positions of retained pixel

channels is discarded by the camera. To make visualization pos-

sible, we include the positional information implicitly by placing

unmasked pixels in the correct position.

Attention

Bag Attention

Figure 5. Illustration of the types of attention we experiment with.

Bag attention is computed in two stages. First, individual bags

compute attention. Then, attention is computed across bags.

Augmentation Top-1

Random Shuffle

Stride

Window Shuffle

Cyclic

Reverse

Baseline 3.06

5.64

18.14

60.97

61.23

60.81

Table 7. Ablation showing the Top-1 ImageNet accuracy of BF-Ti

on JPEG images (k = 32, quality factor 100). See text for details.

window attention [22, 1] where tokens are divided into local

windows and each local window computes self-attention in-

dependently, and (3) bag (or hierarchical) attention [28, 3]

where tokens are broken up into bags and each bag com-

putes intra-bag self-attention. Inter-bag self-attention is

then computed on the resultant output. These different

methods are visualized in Figure 5 while results on TIFF

data are summarized in Table 6. We choose TIFF for these

experiments because of its long sequence length (Table 1).

We find window attention to outperform bag attention. Note

that full attention cannot be run due to its O(n 2 ) depen-

dence on sequence length n. In our main experiments, weIN TIFF

IN fCHW

IN PNG

IN JPEG

SC FP32

SC UINT8

0.8

128

0.6

1.0

0.4

128

0.2

128

0.0

Figure 6. |x · y|/(||x|| · ||y||) for pairs x, y of token embeddings (top row) and positional embeddings (bottom row) learned by BF-Ti. We

show results for various file encodings on ImageNet (IN) and Speech Commands (SC).

used shifted window attention.

Effect of Byte Ordering: To better understand Byte-

Former’s behavior, we ask, does ByteFormer simply learn

byte frequencies, or is the byte ordering relevant? In Ta-

ble 7, we apply a series of augmentations during training

and validation. We focus on the case of JPEG compression

at quality factor 100 with our standard kernel size k = 32.

Each augmentation modifies the byte order of the inputs

in some way. In random shuffle, we randomly re-

order the bytes during training and validation. The order

is redrawn every iteration. This severely degrades accuracy.

Next, we perform a strided sampling with stride size 1024

(e.g. [0, 1024, 2048, . . . , 1, 1025, 2049, . . .]). This slightly

improves accuracy over the previous method by improving

byte order consistency. Next, we experiment with window

shuffle, in which the bytes from each window of size

1024 are consistently permuted. This increases accuracy

to 18.14%. Next we experiment with a cyclic shift in

which the second half of the image bytes are moved to

the beginning. Accuracy matches the baseline (unaltered

JPEG bytes) closely. Similarly, reverse, in which the

byte order is reversed, preserves locality well and matches

the baseline. We find that our model is sensitive to locality,

and does not only learn byte frequencies.

Learned Token Embeddings: We study the token em-

beddings learned by ByteFormer. These embeddings are

used to project file bytes into vector representations. In Fig-

ure 6 (top row), we observe the absolute value of the cosine

distance |x · y|/(||x|| · ||y||) between each pair of token em-

beddings x, y on a variety of file encodings. We choose this

metric to highlight the difference between (anti-)correlated

embeddings (bright patches) and uncorrelated embeddings

(dark patches). The pattern varies substantially across in-

put encodings and tasks. In TIFF, PNG, and fCHW, we

observe a bright band off of the diagonal, corresponding to

high correlation between bytes and their neighbors. This

matches our expectations, since replacing a byte with its

neighbor only slightly alters the image. This does not hold

for JPEG due to the Huffman encoding step. We also ob-

serve that the correlation between token embeddings in the

float32 encoding of Speech Commands is generally weak.

We believe this occurs because the float32 audio amplitude

value is split across four bytes in the file encoding, weaken-

ing the association between byte values and amplitudes.

Learned position embeddings We visualize the absolute

value of the cosine distance between the first 256 posi-

tional embeddings learned by ByteFormer in Figure 6 (bot-

tom row). For JPEG, we see a strong band of highly un-

correlated values at early positions, corresponding to the

file header. Later positions demonstrate interesting pat-

terns that may arise due to the Huffman encodings crossing

byte boundaries. In TIFF, a small band of highly uncorre-

lated values is visible early on, corresponding to the header

(which is shorter than in the JPEG case).

7. Limitations

The accuracy of ByteFormer depends on the file en-

coding chosen. As shown in section 5, choosing JPEG

over TIFF results in a reduction of accuracy on ImageNet.

Adding invariance to file encodings is future work.

As discussed in subsection 4.2, our choice of ϕ for our

obfuscation method does not provide cryptography-level se-

curity against an attacker with access to a large set of model

inputs. We view this method as a building block for security

experts to design thoroughly analyzed, secure systems.

Finally, our method has only been evaluated on classi-

fication for images and audio. Experimenting with other

domains (video, text) and tasks that require fine-grained lo-

calization (detection, segmentation) is exciting future work.8. Conclusion

We present ByteFormer, a model that consumes only

bytes and does not explicitly model the input modality. We

show that it achieves strong performance on image and au-

dio classification without hyperparameter tuning or archi-

tecture modifications. We show how ByteFormer can be

used in conjunction with image obfuscation techniques with

little or no loss in accuracy. We also demonstrate how Byte-

Former can be incorporated into a privacy-preserving cam-

era to enable inference without forming a full image at cap-

ture time.

References

[1] Iz Beltagy, Matthew E Peters, and Arman Cohan. Long-

former: The long-document transformer. arXiv preprint

arXiv:2004.05150, 2020.

[2] T. Boutell.

Png (portable network graphics) spec-

ification.

https://www.rfc-editor.org/rfc/

rfc2083. Accessed: 2023-03-05.

[3] Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y

Chen, Andrew D Trister, Rahul G Krishnan, and Faisal

Mahmood. Scaling vision transformers to gigapixel images

via hierarchical self-supervised learning. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 16144–16155, 2022.

[4] Alex Clark. Pillow (pil fork) documentation, 2015.

[5] Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and

Quoc V. Le. Randaugment: Practical automated data aug-

mentation with a reduced search space. 2020 IEEE/CVF

Conference on Computer Vision and Pattern Recognition

Workshops (CVPRW), pages 3008–3017, 2019.

[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and

Li Fei-Fei. Imagenet: A large-scale hierarchical image

database. 2009 IEEE Conference on Computer Vision and

Pattern Recognition, pages 248–255, 2009.

[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,

Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,

Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-

vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is

worth 16x16 words: Transformers for image recognition at

scale. ArXiv, abs/2010.11929, 2020.

[8] Simon Foucart and Holger Rauhut. A Mathematical Intro-

duction to Compressive Sensing. Springer, 2013.

[9] Graham M. Gibson, Steven D. Johnson, and Miles J. Padgett.

Single-pixel imaging 12 years on: a review. Opt. Express,

28(19):28190–28208, Sep 2020.

[10] Yuan Gong, Yu-An Chung, and James R. Glass. Ast: Audio

spectrogram transformer. ArXiv, abs/2104.01778, 2021.

[11] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu,

and Jason Yosinski. Faster neural networks straight from

jpeg. In S. Bengio, H. Wallach, H. Larochelle, K. Grau-

man, N. Cesa-Bianchi, and R. Garnett, editors, Advances in

Neural Information Processing Systems, volume 31. Curran

Associates, Inc., 2018.

[12] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep

residual learning for image recognition. 2016 IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

pages 770–778, 2015.

[13] Catherine Higham, Roderick Murray-Smith, Miles Padgett,

and Matthew Edgar. Deep learning for real-time single-pixel

video. Scientific Reports, 8, 02 2018.

[14] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac,

Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop-

pula, Andrew Brock, Evan Shelhamer, Olivier J. H’enaff,

Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals,

and João Carreira. Perceiver io: A general architecture for

structured inputs & outputs. ArXiv, abs/2107.14795, 2021.

[15] Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zis-

serman, Oriol Vinyals, and João Carreira. Perceiver: General

perception with iterative attention. In International Confer-

ence on Machine Learning, 2021.

[16] Neha Jawalkar, Kanav Gupta, Arkaprava Basu, Nishanth

Chandran, Divya Gupta, and Rahul Sharma. Orca: Fss-based

secure training with gpus. Cryptology ePrint Archive, Paper

2023/206, 2023. https://eprint.iacr.org/2023/

206.

[17] Peter Kabal.

Wave file specifications.

https:

//www.mmsp.ece.mcgill.ca/Documents/

AudioFormats/WAVE/WAVE.html.

Accessed:

2023-03-05.

[18] Byeonggeun Kim, Simyung Chang, Jinkyu Lee, and Dooy-

ong Sung. Broadcasted residual learning for efficient key-

word spotting. In Interspeech, 2021.

[19] Nishant Kumar, Mayank Rathee, Nishanth Chandran, Divya

Gupta, Aseem Rastogi, and Rahul Sharma. Cryptflow: Se-

cure tensorflow inference. 2020 IEEE Symposium on Secu-

rity and Privacy (SP), pages 336–353, 2019.

[20] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.

Exploring plain vision transformer backbones for object de-

tection. In Computer Vision–ECCV 2022: 17th European

Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed-

ings, Part IX, pages 280–296. Springer, 2022.

[21] Hao Liu, Wilson Yan, and Pieter Abbeel. Language quan-

tized autoencoders: Towards unsupervised text-image align-

ment, 2023.

[22] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng

Zhang, Stephen Lin, and Baining Guo. Swin transformer:

Hierarchical vision transformer using shifted windows. 2021

IEEE/CVF International Conference on Computer Vision

(ICCV), pages 9992–10002, 2021.

[23] Ilya Loshchilov and Frank Hutter. Decoupled weight de-

cay regularization. In International Conference on Learning

Representations, 2017.

[24] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot-

taghi, and Aniruddha Kembhavi. Unified-io: A unified

model for vision, language, and multi-modal tasks. ArXiv,

abs/2206.08916, 2022.

[25] James

Lyons.

Mel

frequency

cep-

stral

coefficient

(mfcc)

tutorial.

http://practicalcryptography.com/miscellaneous/machine-

learning/guide-mel-frequency-cepstral-coefficients-mfccs/.

Accessed: 2023-03-06.Burovski, Pearu Peterson, Warren Weckesser, Jonathan

[26] Dave Marshall.

The discrete cosine transform.

Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wil-

https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html.

son, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J.

Accessed: 2023-03-06.

Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey,

[27] Sachin Mehta, Farzad Abdolhosseini, and Mohammad

İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, De-

Rastegari. Cvnets: High performance library for computer

nis Laxalde, Josef Perktold, Robert Cimrman, Ian Henrik-

vision. Proceedings of the 30th ACM International Confer-

sen, E. A. Quintero, Charles R. Harris, Anne M. Archibald,

ence on Multimedia, 2022.

Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt,

[28] Sachin Mehta, Ximing Lu, Donald L. Weaver, Joann G. El-

and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algo-

more, Hannaneh Hajishirzi, and Linda G. Shapiro. Hat-

rithms for Scientific Computing in Python. Nature Methods,

net: An end-to-end holistic attention network for diagnosis

17:261–272, 2020.

of breast biopsy images. ArXiv, abs/2007.13007, 2020.

[41] Sameer Wagh, Divya Gupta, and Nishanth Chandran. Se-

[29] Dianwen Ng, Yunqi Chen, Biao Tian, Qiang Fu, and

curenn: 3-party secure computation for neural network

Chng Eng Siong. Convmixer: Feature interactive convolu-

training. Proceedings on Privacy Enhancing Technologies,

tion with curriculum learning for small footprint and noisy

2019:26 – 49, 2019.

far-field keyword spotting. ICASSP 2022 - 2022 IEEE Inter-

national Conference on Acoustics, Speech and Signal Pro-

[42] Pete Warden. Speech commands: A dataset for limited-

cessing (ICASSP), pages 3603–3607, 2022.

vocabulary speech recognition. ArXiv, abs/1804.03209,

2018.

[30] M. Nilsson. The audio/mpeg media type. https://www.

rfc-editor.org/rfc/rfc3003.html. Accessed:

[43] Wikipedia. Jpeg. https://en.wikipedia.org/

2023-03-05.

wiki/JPEG. Accessed: 2023-03-05.

[31] Jeongsoon Park and Justin Johnson.

Rgb no more:

[44] Hongyi Zhang, Moustapha Cissé, Yann Dauphin, and David

Minimally-decoded jpeg vision transformers.

ArXiv,

Lopez-Paz. mixup: Beyond empirical risk minimization.

abs/2211.16421, 2022.

ArXiv, abs/1710.09412, 2017.

[32] G. Parsons and J. Rafferty.

Tag image file for-

[45] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and

mat (tiff). https://www.rfc-editor.org/rfc/

Yi Yang. Random erasing data augmentation. In AAAI Con-

rfc3302. Accessed: 2023-03-05.

ference on Artificial Intelligence, 2017.

[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,

A. Performance

Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning

Our goal is to directly model file bytes, with privacy-

transferable visual models from natural language supervi-

preserving applications. Since all previous methods involve

sion. In International conference on machine learning, pages

8748–8763. PMLR, 2021.

some level of domain-specific modeling at inference time

[34] Vijay Raghunathan.

Ece264:

Huffman coding.

(including file decoding and different stems for different in-

https://engineering.purdue.edu/ece264/17au/hw/HW13?alt=

put domains), direct comparison disadvantages our model.

huffman. Accessed: 2023-03-07.

Nevertheless, it’s important to characterize the runtime of

[35] Deevashwer Rathee, Mayank Rathee, Nishant Kumar, Nis-

our model, and compare to previous approaches. This helps

hanth Chandran, Divya Gupta, Aseem Rastogi, and Rahul

to contextualize our model’s performance.

Sharma. Cryptflow2: Practical 2-party secure inference. In

We compare BF-Ti to related works in Table 8. We show

27th Annual Conference on Computer and Communications

performance

on ImageNet [6], including efficiency and ac-

Security (ACM CCS 2020). ACM, August 2020.

curacy

metrics.

We only report train time and throughput

[36] James Robert, Marc Webbie, et al. Pydub, 2018.

for

models

trained

ourselves. This is to avoid hardware

[37] Akash Shah, Nishanth Chandran, Mesfin Dema, Divya

differences

creating

inconsistent

results. For these experi-

Gupta, Arun Gururajan, and Huan Yu. Secure featurization

ments,

tuned

BF-Ti’s

batch

sizes

to maximize GPU uti-

and applications to secure phishing detection. In Proceedings

lization. Note that this improved training time by a rela-

of the 2021 on Cloud Computing Security Workshop, CCSW

’21, page 83–95, New York, NY, USA, 2021. Association for

tively small amount (less than 10%) compared to the exper-

Computing Machinery.

iments in section 5.

[38] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco

Our model’s size and accuracy (8.82 million parameters,

Massa, Alexandre Sablayrolles, and Herv’e J’egou. Training

77.27%) falls between DeiT-Ti (5.72 million parameters,

data-efficient image transformers & distillation through at-

78.62%) and DeiT-S (22.05 million parameters, 73.20%).

tention. In International Conference on Machine Learning,

Our model’s forward pass time is slower due to the large

2020.

number of tokens being modeled. Domain-specific model-

[39] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob

ing (which our model avoids) can drastically reduce com-

Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,

pute time.

and Illia Polosukhin. Attention is all you need. ArXiv,

Compared to other multi-modal models [15, 14], our

abs/1706.03762, 2017.

model

achieves a comparable accuracy with far fewer flops

[40] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt

and a far smaller model. Note that our model performs no

Haberland, Tyler Reddy, David Cournapeau, EvgeniModel M E[L t ] Top-1 Sec P (M) F (B) Im/s

MobileNetv3 Large

ResNet-50 ✘

✘ N/A

N/A 75.1

78.12 -

- 5.43

25.55 0.22

4.017 9615

3488

DeiT-S p16

DeiT-Ti p=16

DeiT-Ti p=14

DeiT-Ti p=8

RGB No More DeiT-Ti [31] ✘

✘

✘ 196

196

256

784

196 78.62

73.20

74.62

77.44

75.1 336

334

331

824

- 22.05

5.72

5.69

5.72

5.72 4.61

1.26

1.70

7.06

1.26 3594

6885

4970

1243

6885

Perceiver (learned pos) [15]

Perceiver IO (learned pos) [14]

Perceiver (conv) [15]

Perceiver IO (conv) [14] ✔

✔

✔ N/A

N/A

N/A 67.6

72.7

77.4

82.1 -

- 55.9

62.3

42.1

48.6 62.3

407

367

369 -

BF-Ti k=32

BF-Ti k=32 -C

BF-Ti k=32 -C -NPE ✔

✔

✔ 9415

9415

9415 77.27

74.54

68.42 1314

1122

1121 8.82

7.64

5.83 23.74

12.63

12.63 373

370

372

BF-Ti k=4 f0.05

BF-Ti k=4 f=0.1

BF-Ti k=8 f=0.25 ✔

✔

✔ 3762

7524

9407 67.53

71.26

73.65 368

580

769 6.70

7.42

7.93 5.70

11.07

15.40 1687

875

634

Table 8. ImageNet Top-1 accuracy. M: whether the model accepts

various modalities (✘: No. ✔: Yes, but with modality-specific

modeling. ✔: Yes). E[L t ]: length of token inputs to transformer

(after Conv1D for BF-Ti. Note, Perceiver feeds inputs through

cross-attention, so this concept doesn’t directly apply). Sec: Train

epoch time (only reported for models we train to avoid hardware

differences affecting results). P (M): Number of parameters (mil-

lions). F (B): Number of flops (billions). Im/s: Throughput

(images/sec) on an A100 80GB Nvidia GPU. “-” means “not re-

ported”. For DeiT, p is patch size. Choosing p ≤ 4 is unfeasible

(epochs take hours or days). For BF-Ti, k is conv kernel size,

and f indicates fraction of retained pixels for privacy-preserving

camera experiments (subsection 5.4). “-C” indicates replacement

of Conv1D with a windowed mean. “-NPE” indicates an ablation

that removes the positional embedding.

domain-specific modeling at inference time. By contrast,

Perceiver includes domain-specific modeling (file decoding,

tensor reshaping, and optionally convolutions) at inference

time.