JSON-based data sources
  • 03 Dec 2024
  • 9 Minutes to read
  • Contributors
  • Dark
    Light

JSON-based data sources

  • Dark
    Light

Article summary

A JSON document does not contain tabular data, unlike a CSV or XLS file, which is why data sources based on them work a little different.

You have fine-grained control over which parts of the JSON document end up in your data source by using JMESPath expressions. JMESPath is a powerful tool to select and transform JSON data, but the syntax needs some getting used to.

On this page we demonstrate how to set up a simple data source with JMESPath expressions, followed by some in-depth examples.

Try it yourself

You can download the file below to play around with the examples yourself.

Upload it as file asset in the asset manager and then create a data source based on it. You can then edit the structure of the data source and fill in the JMESPath expressions explained below.

products
59.50 KB

Define a source

Firstly, we have to define which part of our JSON document contains the data that we are interested in.

Depending on the document, that may be fairly obvious – for instance if the document is one large JSON array. However, oftentimes a JSON document may be the output of an API endpoint or an export from an external system and hence contains a lot more information besides what we are interested in.

To create a data source, we need an array or object and we can select (or create) one from our document with a JMESPath expression. The system will then create a data source row for each item in the array resp. property value of the object (as defined by the column expressions).

Let's look at an example document:

{
  "products": [
    {
      "id": 21,
      "title": "Cucumber"
    },
    {
      "id": 22,
      "title": "Dog Food"
    },
    {
      "id": 23,
      "title": "Eggs"
    }
  ],
  "categories": ["ingredients", "furniture", "pets"]
}

Here we are only interested in the products property, which holds an array of objects. The JMESPath expression to get it is fairly simple – it’s just the name of the property:

products

This will return the array of objects associated with the products property. For each item in the array, a data source row will be created.

Define columns

Now that we have our source, we must define which parts of it we want to convert into data source columns.

Let’s say each product object from the example above is a bit more complex and looks like this:

{
	"id": 21,
	"title": "Cucumber",
	"description": "Crisp and hydrating cucumbers, ideal for salads, snacks, or as a refreshing side.",
	"category": "groceries",
	"price": 1.49,
	"discountPercentage": 11.44,
	"rating": 4.71,
	"stock": 22,
	"tags": [
		"vegetables",
        "onsale"
	],
	"sku": "6KGF2K6Z",
	"weight": 9,
	"warrantyInformation": "5 year warranty",
	"shippingInformation": "Ships overnight",
	"availabilityStatus": "In Stock",
	"reviews": [
		{
			"rating": 4,
			"comment": "Very satisfied!",
			"date": "2024-05-23T08:56:21.620Z",
			"reviewerName": "Elijah Hill",
			"reviewerEmail": "elijah.hill@x.dummyjson.com"
		},
		{
			"rating": 5,
			"comment": "Fast shipping!",
			"date": "2024-05-23T08:56:21.620Z",
			"reviewerName": "Liam Garcia",
			"reviewerEmail": "liam.garcia@x.dummyjson.com"
		},
		{
			"rating": 4,
			"comment": "Excellent quality!",
			"date": "2024-05-23T08:56:21.620Z",
			"reviewerName": "Ella Cook",
			"reviewerEmail": "ella.cook@x.dummyjson.com"
		}
	],
	"returnPolicy": "30 days return policy",
	"minimumOrderQuantity": 7,
	"images": [
		"https://cdn.dummyjson.com/products/images/groceries/Cucumber/1.png"
	],
	"thumbnail": "https://cdn.dummyjson.com/products/images/groceries/Cucumber/thumbnail.png"
}

As you can see, each product contains nested data and we may not be interested in all of it and we have to bring it into a tabular form.

To do so, we create as many column definitions as we need, each one defined by a unique name and type (as with other data sources). The data for each column can come from anywhere in the object, it doesn’t have to be a top-level property.

Here are some examples for column definitions:

  • Column: Id
    Type: Number
    Path: id
    Result: 21

  • Column: Title
    Type: Text
    Path: title
    Result: Cucumber

  • Column: Reviews
    Type: List<Text>
    Path: reviews[*].comment
    Result: [ "Very satisfied!", "Fast shipping!", "Excellent quality!" ]

Note how in the last column we are able to pick the comment from each object in the reviews array and get a single list of strings as a result, which we can store in a List<Text> column in our data source.

JMESPath allows us to select and transform data in a myriad of ways – here are a few more advanced examples:

  • Get the URL of the first image (or null if the images array is empty or missing):
    Path: images[0]
    Result: https://cdn.dummyjson.com/products/images/groceries/Cucumber/1.png

  • Combine all items in the tags array into 1 string:
    Path: join(',', tags)
    Result: vegetables,onsale

  • Get the average rating across all reviews of the product:
    Path: avg(reviews[*].rating)
    Result: 4.333333333333333

Advanced Scenarios

In this section we tackle a few more in-depth problems that you may run across in your data sources. We recommend that you familiarize yourself with JMESPath a bit before diving into these samples.

JMESPath Community Edition

The JMESPath Community Edition enhances JMESPath with a number of useful features that we will make use of in the examples below. You won’t find information about these features on the official JMESPath website. Both sites are a valuable source of information and examples, but please keep in mind that only the Community Edition contains information about the newer features.

You can try the examples yourself by using the below JSON file for your data source.

products_options
1.25 KB

Here is an excerpt from that file so you can get an idea of what we’re dealing with:

{
  "products": {
    "p1": {
      "title": "Snare Boot",
      "productType": "Boots",
      "options": [
        {
          "name": "Color",
          "optionValues": [
            {
              "name": "Black"
            },
            {
              "name": "Blue"
            }
          ]
        }]}}}  // collapsed for brevity

This example showcases 2 common scenarios you may come across:

  1. inside products the property names are used as unique keys to identify each product (“p1")

    1. compare this to the simple example above, where we have an id property in each product object, which is easier to access as we’ll soon learn

  2. the options array can contain multiple objects which are identified by their name property

    1. this makes the “Color" option more difficult to access, as we’ll see

Property names as keys

Let’s assume that we want to store the product IDs (“p1", “p2" etc.) as an id column in our data source. We can use products as our source expression, which seems good at first but unfortunately, only the property values inside products will be available to our column expressions – the property names will be lost.

  • We can use items(products) instead, which will turn each property name/value pair into an array [ <name>, <value> ] instead

  • With that, we can go to our column definitions and use [0] to get the property name (“p1") and [1] to access the property value (e.g. [1].title would yield “Snare Boot")

  • While this works, it makes all our column definitions unwieldy, as most will have to start with [1]

Fortunately, we can transform the source even more to make our column definitions easier:

map(&merge({id: [0]}, [1]), items(products))

Let’s break down what happens in this expression:

  • as before, we use items(products) to get an array of [ <name>, <value> ] arrays

  • this array is passed to the map function as its 2nd argument. map applies an expression to each item of an array, thus transforming it. These function-like expressions start with &

  • the expression we want to apply to each item is merge({id: [0]}, [1]) which takes the product object ([1]) and merges it with another object that we create in place

  • This in-place created object is used to store our property name in an “id” property:
    {id : [0]}

The result of this expression is an array of objects that now contain the product id as an easy-to-access property:

{
  "id": "p1",
  "title": "Snare Boot",
  "productType": "Boots",
  "options": [
    {
      "name": "Color",
      "optionValues": [
        {
          "name": "Black"
        },
        {
          "name": "Blue"
        }
      ]
    }
  ]
}

Filtering to find objects

Let’s continue the previous example by creating a column definition to hold all colors of a product as a List<Text>.

As we outlined in the introduction, the options array contains a number of objects that are identified by their name property, which makes it more difficult for us to access the “Color" object. Let’s look at the final expression again and then break it down:

options[?name=='Color'].optionValues[*].name[]
  • JMESPath allows us to filter arrays by writing a filter expression: [? name == ‘Color’] will return all objects from the array that have a name property with the value “Color"

  • Next we need to access the optionValues to get to the actual color names; optionValues[*].name will do that for us

  • We now get an array of names inside an array as a result: [["Black", "Blue"]]  (due to how projection and filters work in JMESPath)

  • The flatten operator [] will collapse these nested arrays into a single array ["Black", "Blue"] which we can finally use for our column

Get all Colors across all Products

You may want to create a data source that contains all possible product colors, e.g. to allow users to filter products by their preferred color. Unfortunately, our source JSON doesn’t list all available colors explicitly. Each product indicates in which colors it is available, but it is our job to combine all these options into a single table.

We can use the filter expression from the previous example, but apply it to all products:

products.*[].options[?name=='Color'].optionValues[][].name

As you can see we have to use the flatten operator [] several times here – it’s best if you just copy the expression and take it apart yourself to see the individual results and what each subsequent expression does.

While this solves our problem, we do get duplicate color names. If we only want distinct color names in our data source, we have to augment the above expression even further. We can use the group_by function to group all our color objects by their name (thus eliminating any duplicates) and then use the keys function to return only the property names. Again, we recommend you play around with the expression yourself to see the in-between steps

keys(group_by(products.*[].options[?name=='Color'].optionValues[][], &name))

Create a Permutation / Cross-Join Table

Let’s say we want to create a data source as shown below, where we list each product in each color it is available:

One use case for a table such as this is to filter by product title, to end up with all available colors for that product. Note that you would usually want to use a product ID and not a title for something like this, but since the ID isn’t easily accessible in this scenario (as demonstrated in the 1st example), we will use title for simplicity. You can of course combine the 1st example with this one to use the product ID instead of the title.

Here is the source expression to arrive at this table:

map(&let $product = @ in options[?name=='Color'].optionValues[*].{ title: $product.title, color: name}[], values(products))[]
  • We are using the map function again so that we can process each product individually

  • The key trick in this example is to introduce a variable with the let expression, which allows us to store the current object @ in the variable $product so that we can use it further on.

  • In JMESPath, there is no way to access a ‘parent’ object once we navigated down to e.g. the Color level in an object. We can store such an object in a variable though and thus have access to it at that level.

  • The Color filter has already been showcased in previous examples, the last noteworthy action here is the in-place creation of a new object which holds the product title and color name: { title: $product.title, color: name}


Was this article helpful?