1. Customer Journey

Customer Journey is a record of all different events that have to do with how customer iteracts with your business, your service or your product. They capture all possible interations from different events on your website to phone calls form your staff.

Customer Journeys are very powerful, but are hard to analyze via conventional means.

2. Data Integration and Cleaning

Data Integration and Cleaning is something every Data Scientist has to go though and PythonQL can speed this process up tremendously. We show how to quickly assess the quality of a new dataset that you're trying to integrate and how much value it will bring to the table.

3. Model Evaluation

When you train multiple models, comparing different models and deciding which model to use in which case can be tricky.

PythonQL helps solve this problem in a declarative way.

4. Nested and Semi-structured Data

Working with nested data or data with flexible schema can be very empowering, especially when your data is already coming in JSON or XML formats. But querying and constructing nested objects can be quite challenging.

Customer Journey

Collecting the full information about your customer’s journey through your сompany's services and then analyzing your business based on these journeys is a very powerful method of improving customer experience and optimizing the overall business of the company. However, a customer journey is a list of heterogenous events with timestamps, and running analytics on such structures is pretty challenging. Simply modelling these journeys as a table in a relational database and using SQL to process them doesn’t quite work. Neither do typical data analyzis frameworks.

Why is that? Turns out that you need to aggregate data about various properties of multiple journeys, but the properties themselves also need to be computed from the corresponding sequence of events. This requires nested data models and nesting in the query language, together with all the analytical capabilities of SQL. PythonQL is quite useful here, since it can seamlessly combine queries on nested data with advanced analytics that is especially useful in data science.

Data Set description.

For this use case we’ll use a toy dataset from an imaginary Bank. The customers of the bank hold accounts with cards, can deposit and withdraw money, and can get loans from the bank. The bank makes most of its money by giving out loans, but as you’ll see shortly, default rates are not the only questions we might want to ask regarding the customer journeys. So our simple customer journey has the following events in it:

1. The customer opens an account in the bank

2. The customer deposits and withdraws money from the card

3. The customer requests a loan

4. The bank approves and issues a loan

5. The customer pays the loan

6. The bank calls the customer to remind her that the loan is due

7. The customer closes the bank account

Each event type has some data associated with it, we could model it in a number of ways. We will model it

with Python's named tuples, and each event will have its own schema (some fields will intersect with other events).

You can view the sample data by running Q0 of this scenario.

Data Integration and Cleaning

Even though data is supposed to be clean and of top quality in the 21st century, that’s not always the case. Unfortunately, cleaning and integrating dirty data is a very common problem that falls onto data scientists. This is especially painful when you just need to get a quick assessment of different data sources. Or whether integrating a new data source would yield results that are worth the trouble. So in this use-case we focus on quick methods to assess a data source with PythonQL.

In this scenario we have two data source: the customer database, where we have collected data about our customers; and we have an order database - this is a new database we're considering integration with our main data. It contains orders made by different folks in retail stores aroung the country. However the quality of the data and its completeness is a bit questionable. Our task in this scenario is to quickly find out if this data is useful to us.

Model Evaluation

As every data scientist knows, getting the first results from your models is exciting, but lots and lots of time will be spent analysing the model’s performance and improving its results. Surely, basic measures like AUC or MRSE are easy to calculate, but what happens when you need to qualitatively compare the two models on the data and figure out which cases are better covered by each model? In this case the ad-hoc analytical abilities of PythonQL come to the rescue!  So we won’t be querying databases, we will be querying the results of our machine learning models in this scenario. Let’s get started!

Data set description.

For this example we will use a housing data set and train different models on it. We will be predicting the value of the property from fake data (currently), based on a number of features: [zip code, city, average earnings of the residents, distance to the beach, distance to the nearest park, distance to a nearest school, crime rate]

Lets define that an error in our prediction is when a value falls out of 1 sigma from the mean. We will first compare the results of a single model, and then we'll compare two models against each other.

Nested and Semi-structured Data

In this scenario we will be querying JSON data that comes from a NoSQL database, we'll be joining it to some more structured, relational data. And finally, we'll be constructing somewhat nested JSON objects that use hierarchy to organize data, which in certain cases is more convenient, than producing a bunch of relational tables.


We have a JSON product dataset, that supposedly came from some NoSQL system (in the future PythonQL will be able to send queries to NoSQL databases). Its a somewhat dirty product catalog (cars) with reviews.

Car make, model and year fields are common accross all objects.

The manufacturer can be a plain text field, or a JSON object: {"manufacturer": {"company":"Ford", "address":"..."}}

Reviews can be just a list of text reviews: ["Great car","Terrible seats",...], or a list of JSON objects: [{"review_text":"Great car", "stars":5, "author":"Jack"}, ...]

The price also can be a text field or a JSON object: {"price":{"suggested_retail_price":15350, "average_price":14600}}

We also have a small relational dataset that is better structured and has dealer information in it:

[dealer, car_make, car_model, car_year, purchase_price, ad_price]