LanguageLogger: How to Study Smartphone Language Use in the Wild with an Android Keyboard App

Florian Bemmann
8 min readJan 7, 2021

The words people type into their smartphone everyday can tell a lot about their personality, mood, relation to the person they are texting with, and much more. This makes data of real-world language use a valuable source for researchers in many fields — psychology, sociology, human-computer interaction — to name just a few. But this also brings many challenges: The typed content contains lots of private information, making it very difficult to work with it as a researcher. Potential participants are (for good reasons) not willing to provide you all their text messages, and even if you’d get them, a secure and thus cumbersome workflow in data analysis is of utmost importance.

But don’t worry — there are smarter ways to solve this problem. What if you could conduct your data preprocessing on the participants’ smartphone, and only transmit the abstracted, anonymous data you really need to your researcher computer?

In our recent paper “LanguageLogger: A Mobile Keyboard Application for
Studying Language Use in Everyday Text Communication in the Wild
” we tackle this issue: We developed an Android keyboard app that brings word categorization, word frequency counting, and matching regular expressions to your participants smartphone. The resulting data does thus not contain raw contents anymore making your study less privacy invasive and easier to recruit for. And the first step of your data preprocessing is already done, nice! (Nobody likes data preprocessing, really.)

This article will show you what LanguageLogger exactly is capable of and what it does, so that you can judge how it could facilitate your next research project.

What does LanguageLogger?

LanguageLogger is an Android keyboard app that abstracts the typed words directly on the participant’s device. Only abstracted, thus anonymous data is logged. It incorporates three core-concepts of language analysis:

Whitelist Counting
You as researcher define a list of words. LanguageLogger will count how often the user types these words during the study.

Word Categorization
You define a set of categories, and words that belong to those categories. Whenever the user types a word that belongs to a category, an event for the category is logged. Events contain timestamps, are related to the “typing session” (e.g. text message) they were entered in, and contain meta data like the target app (e.g. WhatsApp), character count, …

Custom Regex Filtering
Can be used to log words that match a specific pattern, defined as regular expression.

Example: Where do people buy animal-related stuff?

The LanguageLogger abstraction concept: Only the desired categories and word counts are logged. Privacy-invasive data like names and other details are discarded.

Imagine you are interested in animal-related shopping behaviour. You could create categories for animal-words and shopping-words. To get details on which shops people talk about, you further collect a list of possible shops. And lets say you love emojis and want to log them too — therefore you configure a regular expression that matches all emojis.

What is LanguageLogger now actually logging? Only the contents relevant to your research. Other, privacy-invasive contents, e.g. the information that the message recipient might go shopping tomorrow and that Thomas will join, is not stored.

The resulting data

You can find the log data resulting of the above example as database dump here.

SQL log data (category events) of the above example. Can you spot the word categories we were looking for and the emoji?
The entry for that one typing session.
And the updated word count table, containing just one count so far.

For details on the resulting data please consult the README, where further data like the touch events and sensors are listed as well.

Getting Started — How To Setup LanguageLogger

To make the Android App work you have to setup the accompanying backend, one some online-accessible machine. The code for both Android App and the backend can be found on GitHub:

Step 1: Setup the Backend

Follow the steps 1 and 2 in the backend project’s repository to setup the backend with a database. For testing purposes you can simply do that on your local computer, for a real study you need a server / virtual machine that is accessible from the web via HTTPS. For real studies you should further consider an encrypted database, like the MySQL derivative MariaDB and its File Key Management Encryption plugin.

To let your smartphone access your local backend, we recommend to use a tunnel service like ngrok (README Step 3). It creates a URL that is accessible worldwide, which you will need to run the Android app.

For a real study: This step is only intended for local testing; For production we instead recommend to install nginx on your server and create a SSL certificate via letsencrypt.

Step 2: Configure your Study

If you just want to get started quickly to give it a try, import the demo configuration as described in the README’s step 4.

For a real study: Carefully think about what you want to log. Do you need events or are word counts enough? If you need categories, which words should be mapped to which categories? If you want word counts, which words are interesting for you?

Example configuration files for word categorization (left) and word frequency counting (right).

You could also use a predefined list, for example the DeReWo dictionary of German words to count all dictionary-words. Or the SentiWS dataset, which assigns common words a sentiment score (you could use the score values as categories).

The configuration happens through the UI of the backend. You have to create word / category lists in text files with the file ending *.rime as shown in the figure to the left, and import them in the backend’s UI. For the word categorization the configuration file should contain a word with its category per line (tab separated!); for the word frequency counting simply list one word per line.

Via the UI you can a.o. also configure the study duration and keyboard layouts. Please read the Section Research Configuration in the backend’s README for details and examples on how to do that.

Step 3: The Android App

Having a running backend, it is time for the Android app. Checkout the app project repository and open it in Android Studio. There is just one important adaptation you have to make: Set your backend’s URL. Go to the file researchime-app/ResearchIME-Module/src/main/res/values/strings.xml

and paste the HTTPS url of your backend here (e.g. https://dd02e086701f.ngrok.io if you use ngrok for local testing). After that, you can run the app via Android Studio.

For a real study: Additionally to setting your URL, you should place your server’s certificate in research_cert.crt , and uncomment the certificate check in RestClient. Details on these steps again are listed in the README.

Step 4: Distribute your App to Conduct the Study!

You are ready to go (after some testing of course…). As we doubt that our app might be accepted for the PlayStore, we recommend to distribute it as APK download. Providing a short how-to-setup video or instruction page might be helpful for some participants, because they have to give one special Android security setting to be able to install the downloaded APK.

What else can I do with LanguageLogger?

The above example is just a basic one, demonstrating LanguageLogger’s core concept. Digging deeper, LanguageLogger has some more features. To get to know then in detail we recommend to check out our two latest papers related to LanguageLogger:

Bemmann, Buschek 2020: LanguageLogger: A Mobile Keyboard Application or Studying Language Use in Everyday Text Communication in the Wild

Buschek, Bisinger, Alt 2018. ResearchIME: A mobile keyboard application for studying free typing behaviour in the wild

Some more capabilities of LanguageLogger:

ADDED, CHANGED, REMOVED. Unlike the above example, LanguageLogger does not only log added words. If a user changes an existing word or deletes an existing word entirely, events of type CHANGED and REMOVED are created.

Message Difference Analysis vs. Full Message Analysis. The events ADDED, CHANGED and REMOVED model the real-world user actions, however if you are only interested in the final message sent to a recipient you might not care how often the user changed a word. To also tackle this, LanguageLogger analyzes each message twice:

  • Message Difference: The sequence of user actions is modeled with ADDED, CHANGED and REMOVED events.
  • Full Message: The final message (at the time when the user closed the keyboard) is taken, split into words, and for each word a CONTAINS event is logged.

Lemmatization. In real-world language you have to deal with many possible conjugations. For example the word buy will also occur as bought. The avoid that you would have to add all possible conjugations to your word / category lists, LanguageLogger uses the tool TreeTagger created by Helmut Schmid which lemmatizes words back to their base. (E.g. if a user types bought, LanguageLogger will treat it as buy). This functionality is optional of course (can be configured) and currently only implemented for German language. However it would be fairly easy to add parameter files for other languages (provided on the TreeTagger website). We plan to implement this in the future.

Touch Biometrics. As demonstrated in the 2018 paper ResearchIME: A mobile keyboard application for studying free typing behaviour in the wild , LanguageLogger also can logs touch events. Leveraging those, one could also run analyses regarding typing speeds, touch accuracy, usage of auto correction.

Integration Into Other Data Sources. Our main contribution is bringing these language abstraction concepts to the client device. The keyboard is only one possible data source for this. Due to LanguageLogger’s modular architecture one can also integrate it into other apps. A third GitHub repository demonstrates how to integrate LanguageLogger’s language abstraction logic into a simple Mobile Sensing app. If you are interested in feeding LanguageLogger with notification contents, incoming SMS messages, or anything else you collect on your participants smartphones, please follow the step-by-step guide in the repo’s README.

Do you have any questions about LanguageLogger? Will it facilitate your next research project, or is there something important missing? Feel free to let us know your questions, ideas and opinions via E-Mail!

--

--

Florian Bemmann

PhD student in HCI at LMU Munich, working on Mobile Sensing research tools to facilitate interdisciplinary research in the wild.