The Persuit of Tiny Intelligent

@ Aditya Roshadi | Sunday, Aug 10, 2025 | 5 minutes read | Update at Sunday, Aug 10, 2025

Over the past few years since beginning of 2023 when everyone was in awe for ChatGPT intelligence, I was hooked into dozens of open weight and open source model LLM. For me, to have access to a whole AI model is more interesting no matter how small it is, than just having access to an essentially wrapper of a powerful AI (that we usually call it: API)

At the time, model Quantization was relatively new concept that’s being explored, that was hard enough to find model that’s been quantized lower than INT16. Nowadays even 2-bit quantized model can be everywhere, eventhough the usecase isn’t that many (particularly just for classification), that’s crazy. Nonetheless, I managed to having fun running several open models either just text, vision, audio, or combination of them, on my local machine, a quite decent gaming laptop with just 6GB RTX GPU Memory.

Just this weekend I had an idea (and actually executed) running quantized model in a smartphone, particularly Android. I choose Android because I think it has less restriction for hardware access than iOS. In this trial, I use ONNX format in a hope that once it’s successful I can port it into iOS easily than using platform locked approach where I have to deal with either CoreML or NNAPI.

At first, say it level One, I went with DistilBert model, it’s not an LLM model, rather it’s just an NLP model for classification purpose. I just want to have fun running it on Android. DistilBert itself is a distilled version of Bert, so the size is particularly small, just around 200MB. I still tried to quantized it and got 65MB in size. Original format was in safetensor huggingface, then I convert it to ONNX before quantize it afterward into INT8. I got it running on Android, no drama, but you can only go so far with classification model these days, compared to generative model. So next, I moved into LLM model, I choose TinyLLama.

Level two, with the same approach as DistilBert, I gave TinyLLama a shot. Same flow: safetensor -> ONNX -> ONNX INT8. I managed to have it run on Android, but then you know what? boom! surprise suprise, I got memory error. Fyi, the original TinyLLama 1B params has 4GB size model, I got it shrunk into 1GB, around 75% optimization using INT8 Quantization. For the record, I never create and run an Android app that uses big pile of memory in one go, it’s different between having big app and big memory consuming app, one is talking about non-volatile storage (disk space), the other is about volatile storage (RAM). And actually it’s not even RAM size that’s matter in this case, in JVM based app like Android, there is a thing called HEAP size, and apparently after a quick research, I found out most devices have only around 512MB Heap limit. So yeah, nothing much we can do about it.

Thanks to that finding, I’ve come to the realization that we may not need to have an AI running on our every single device, we’re talking about consumer devices here. Think about this, AI is a different beast of a software, the way it is created, the way it deployed, the way it works are fundamentally different than the regular software. I’m used to be excited to think that someday we may be able to have our own AI, running on our own device, highly personalized to ourself, wouldn’t it be fun? it may become reality someday or it won’t, especially with different hardware level approach that some big companies try, like Apple Neural Engine (ANE), TPU, NPU, APU and soon. Either way, having centralized AI inference server also has notable benefits, like centralized intelligence where it can learn from many more users, it works like a system update we have nowadays, where everyone can benefit from it almost in that seconds of an update. Not to mention that it can be used by any level of hardware as long as it has connectivity (basically like regular software taking advantage of an API). On the scale itself, AI requires huge amount of storage and computational power, that alone is a pain on itself when we’re dealing with edge or even consumer devices, that also is what set it apart from regular software development approach.

As for me, I’m still very much interested in AI optimization and having it run on my own devices. The difference now is that I’m more aware of hardware limitation when we’re doing AI related (especially LLM), I also getting more interested into creating my own LLM from scratch or fine tune some of the base level LLMs, the first one might just be for learning purpose, but the later I suppose will have practical benefit. I also think I’m now becoming more open and pragmatic to use whatever suits the purpose including giant-third-party-cloud-hosted AI from big companies whenever needed, when I think about it, subscribing to that services might actually be more cost effective than self hosted approach, if data security is not that strict of a concern.

But anyway, AI is always fun, its blackbox nature sometimes is full of surprise that makes it interesting things to explore!