Recognize Text in Images with Firebase ML on Android with Text-to-Speech.

Cloud text recognition

Nishān Wickramarathna
5 min readJun 9, 2020


Chances are that in some point your life you wanted to do this so badly for any of your projects, if you’re like me. I did many things to achieve this, but doing it on mobile is the greatest superpower that you can ever have.

Before starting this, do take a look at the official documentation for updated content. One thing to keep in mind is that here I’m talking about cloud text recognition and not ‘on device text recognition’. Also note that this method is for images, not for images of documents.

You should have some prior knowledge in Android Studio to follow me. The project is on github.

First create an empty project with an Empty Activity.

Set minimum SDK level as 21. If you want you can set to enable legacy libraries, but I don’t think it is a must.

Once the project is created, open activity_main.xml and create an ImageView , 1 TextView and 3 Buttons for each action (capture image, covert to text, text-to-speech). I used Relativeayout but you can use ConstrainedLayout . Use the properties as the following.

Since we’re accessing the camera we need to set permissions for that too.

Open AndroidManifest.xml and put following line just before the <application> tag.

<uses-feature android:name=""
android:required="true" />

..and following line inside <activity> tag.

android:value="ocr" />

Now open up and we’ll start coding.

First create variables for the screen controls and get their references.

Now create on click listeners for 3 buttons. After initializing the textView inside onCreate function, and add following.

Now, before moving forward we need to add firebase to your project.

Go to and click on ‘Add Project’

Provide a project name.

Enable analytics.

Select default account for analytics.

After the project is created, you will be sent to the project page. Here you need to add a new app to the project. Click an android icon to add an android project.

Add the package name. (you can find it in applicationId in build.gradle file.)

Follow the guide and download the google-services.json , copy it to the specified location.

In next step, modify the gradle files as advised.

But instead of implementation ‘’ we will add following 2 lines.

implementation ''
implementation ''

Afterwards, click Sync Now.

After that is done, we can move forward.

When user clicks on ‘Capture’ button we take a picture using device camera and place it to the imageView (dispatchTakePictureIntent() function). Then user clicks on ‘Detect’ which will use firebase ML to convert to text and output will be stored in textView (detectTextFromImage() function) and finally ‘Speak’ button will read it out loud. You will notice that we have set to text-to-speech language as `English`. Now we will implement those functions and some other helper functions.

After onCreate function, add following 4 functions.

First two functions will take a picture using device camera.

The Android way of delegating actions to other applications is to invoke an Intent that describes what you want done. This process involves three pieces: The Intent itself, a call to start the external Activity, and some code to handle the image data when focus returns to your activity.

If the simple feat of taking a photo is not the culmination of your app’s ambition, then you probably want to get the image back from the camera application and do something with it.

The Android Camera application encodes the photo in the return Intent delivered to onActivityResult() as a small Bitmap in the extras, under the key "data". The following code retrieves this image and displays it in an ImageView.

Last two functions are used to detect text from the image we have now as a Bitmap. Here we create a FirebaseVisionImage object from theBitmap and using FirebaseVisionTextDetector to get the text in it.

If the text recognition operation succeeds, a FirebaseVisionText object will be passed to the success listener. A FirebaseVisionText object contains the full text recognized in the image and zero or more TextBlock objects.

Each TextBlock represents a rectangular block of text, which contains zero or more Line objects. Each Line object contains zero or more Element objects, which represent words and word-like entities (dates, numbers, and so on).

For each TextBlock, Line, and Element object, you can get the text recognized in the region and the bounding coordinates of the region.

That’s it. I hope you learned something new!. 😎

Edit: I made an automated text recognition app for blind people based on this. Feel free to go through the code and see what I did. I used an old version of CameraX library and Firebase with some basic UI automation to do this. [APKs]