Writing Better Performant Queries with LINQ on EF Core 6.0 ⚙️🔧

This one was in the drafts folder for a long time; I wanted to include my personal experiences as well before giving away any suggestions to the reader, which will eventually be a bad decision. With the release of .NET 6.0, I thought it’s time. Every one of these practices I present today are tested and presented with proof: you can also download the source code and see for yourself (running the migrations will create the database and will seed data).

📌 Prelude

First of all, what is LINQ? Language-Integrated Query (LINQ) is the name for a set of technologies based on the integration of query capabilities directly into the C# language. Traditionally, queries against data are expressed as simple strings without type checking at compile time or IntelliSense support. Furthermore, you have to learn a different query language for each type of data source: SQL databases, XML documents, various Web services, and so on. With LINQ, a query is a first-class language construct, just like classes, methods, events. You write queries against strongly typed collections of objects by using language keywords and familiar operators.

The following example shows the complete query operation. The complete operation includes creating a data source, defining the query expression, and executing the query in a foreach statement.

Complete operation includes creating a data source, defining the query expression, and executing the query.

Let’s start with small things.

🔀 What comes/should come first? WHERE or SELECT

This is something all beginners usually have questions about. Thinking logically one might say, Where first approach is more performant, since it filters your collection first, and then executes Select for filtered values only. Well this is true up to an extent. Consider following two queries.

These two queries will translate to same SQL if you keep an eye out using SQL Server Profiler.

SELECT [e].[Name]
FROM [Employees] AS [e]
WHERE [e].[Name] LIKE N'some text'

So you might think both should take around the same time to execute right? Well yes, broadly speaking.. but sometimes they differ, more precisely;

  • If most entities will pass the Where test, apply Select first, because it will perform better in this case.
  • If fewer entities will pass the Where test, apply Where first.

Following benchmark was designed to test just that. There are more companies where name contains “1”, (> 3000 records) and around 20 records where name contains “500”, entire Employee table has 10,000 records.

And the results were,

If most entities(1) will pass the Where test, apply Select first. If fewer entities(500) will pass the Where test, apply Where first.

So in the case of most entities passing where test, “SelectFirst ”performed better (26.03 ms > 21.06 ms), and in the case fewer entities passing where test “WhereFirst ” performed better (20.89 ms > 19.60 ms).

In addition to the above 2 queries I added 2 more with JOIN operations (with navigation property Company). Note that Company and Employee has one-to-many relationship. (One company has many employees).

In this case, doesn’t matter if test passes for few or large numbers of entities, applying where first resulted in better performance.

SELECT [c].[Id], [c].[Name]
FROM [Employees] AS [e]
INNER JOIN [Companies] AS [c] ON [e].[CompanyId] = [c].[Id]
WHERE EXISTS (
SELECT 1
FROM [Employees] AS [e0]
WHERE ([c].[Id] = [e0].[CompanyId]) AND ((@__NameParams_0 LIKE N'''') OR (CHARINDEX(@__NameParams_0, [e0].[Name]) > 0)))',N'@__NameParams_0 nvarchar(4000)',@__NameParams_0=N'500'
“WhereFirst ” approach in both cases took less time.

So in doubt I recommend you use Where first and then Select. Also notice the space complexity (Allocated column), “WhereFirst” uses less memory (909 KB < 2693 KB) which is the most deciding factor to use “WhereFirst”.

🔢 What comes/should come first? WHERE or ORDERBY

This one can be figured out if you can think in to the problem little bit. Consider these two queries,

If you use the first approach, it requires the entire collection to be sorted and then filtered. If we had a million items, only one of which had a Author first name like “some text”, we’d be wasting a lot of time ordering results which would be thrown away. If you use the second (Where first) approach, this time we’re only ordering the filtered results, which in the sample case of “just a single item matching the filter” will be a lot more efficient — both in time and space.

And the results confirm this too. (1.974 ms < 2.024 ms)

⬆️ Enhancements with EF Core 6.0

EF Core 6.0 contains better support for GroupBy queries. Specifically, EF Core now:

  • Translate GroupBy followed by FirstOrDefault (or similar) over a group
  • Supports selecting the top N results from a group
  • Expands navigations after the GroupBy operator has been applied

As an example now you can write queries like this:

This was not possible on EF Core 5.0, also now you can use Enumerable.Take and Enumerable.First to select only first N records. This is mainly helpful to introduce paging for your queries to increase the execution speed.

This will translate to: (notice the TOP keyword)

SELECT [t0].[FirstName], [t0].[FullName], [t0].[c]
FROM (
SELECT TOP(1) [p].[FirstName]
FROM [People] AS [p]
GROUP BY [p].[FirstName]
) AS [t]
LEFT JOIN (
SELECT [t1].[FirstName], [t1].[FullName], [t1].[c]
FROM (
SELECT [p0].[FirstName], (((COALESCE([p0].[FirstName], N'') + N' ') + COALESCE([p0].[MiddleInitial], N'')) + N' ') + COALESCE([p0].[LastName], N'') AS [FullName], 1 AS [c], ROW_NUMBER() OVER(PARTITION BY [p0].[FirstName] ORDER BY [p0].[FirstName]) AS [row]
FROM [People] AS [p0]
) AS [t1]
WHERE [t1].[row] <= 1
) AS [t0] ON [t].[FirstName] = [t0].[FirstName]

Because of these improvement, these will be less performance bottlenecks in the newer queries. If you have any workarounds applied for these scenarios it is always good to revisit your code base to apply these new changes.

Another improvement is removing last ORDER BY clause when joining for collection. When loading related one-to-many entities, EF Core adds ORDER BY clauses to make sure all related entities for a given entity are grouped together. However, the last ORDER BY clause is not necessary for EF generate the needed groupings, and can have an impact in performance. Therefore, EF Core 6.0 this clause is removed.

For example, consider this query:

With EF Core 5.0 on SQL Server, this query is translated to:

SELECT [c].[Id], [t].[Id], [t].[CustomerId], [t].[OrderDate]
FROM [Customers] AS [c]
LEFT JOIN (
SELECT [o].[Id], [o].[CustomerId], [o].[OrderDate]
FROM [Order] AS [o]
WHERE [o].[Id] = 1
) AS [t] ON [c].[Id] = [t].[CustomerId]
ORDER BY [c].[Id], [t].[Id]

With EF Core 6.0, it is instead translated to:

SELECT [c].[Id], [t].[Id], [t].[CustomerId], [t].[OrderDate]
FROM [Customers] AS [c]
LEFT JOIN (
SELECT [o].[Id], [o].[CustomerId], [o].[OrderDate]
FROM [Order] AS [o]
WHERE [o].[Id] = 1
) AS [t] ON [c].[Id] = [t].[CustomerId]
ORDER BY [c].[Id]

Notice it is no longer ordering the collection by result Ids, only by Customer Ids.

🔂 Loading Related Data

Entity Framework offers several different ways to load the entities that are related to your target entity. For example, when you query for Products, there are different ways that the related Orders will be loaded into the Object State Manager. From a performance standpoint, the biggest question to consider when loading related entities will be whether to use Lazy Loading or Eager Loading.

When using Eager Loading, the related entities are loaded along with your target entity set. You use an Include statement in your query to indicate which related entities you want to bring in.

When using Lazy Loading, your initial query only brings in the target entity set. But whenever you access a navigation property, another query is issued against the store to load the related entity.

The important thing is that you understand the difference between Lazy Loading and Eager Loading so that you can make the correct choice for your application. This will help you evaluate the tradeoff between multiple requests against the database versus a single request that may contain a large payload. It may be appropriate to use eager loading in some parts of your application and lazy loading in other parts.

Suppose you want to query for the customers who live in the UK and their order count.

Using Eager Loading
Using Lazy Loading

When using eager loading, you’ll issue a single query that returns all customers and all orders.When using lazy loading, you’ll issue one query to get the customers first (to Customers table), then each time you access the Orders navigation property of a customer another query (to Orders table) is issued against the store, so two queries.

Which one to select? well it depends. Ask yourself these 3 questions. Depending on your answer you can select either one.

Lazy Loading versus Eager Loading cheat sheet

When you hear performance questions that involve server response time problems, the source of the issue is frequently queries with multiple Include statements. You see, it takes a relatively long time for a query with multiple Include statements in it to go through the internal plan compiler to produce the store command. The majority of this time is spent trying to optimize the resulting query. The generated store command will contain an Outer Join or Union for each Include, depending on your mapping.

This will usually results in a database timeout exceptions like,

Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. The statement has been terminated.

If this happens, usually your query results are two large to retrieve before the default timeout expires. (default command timeout value is 30 seconds in SQL Server). To overcome this, you can unchain the query into multiple include statements or extend default command timeout period.

Command timeout (in seconds) can be set on the database context like this, [doc]

_context.Database.SetCommandTimeout(400);

or with Microsoft.Data.SqlClient version 2.1.0 or greater, in your connection string simply append the command timeout like so:

"Data Source=.;Initial Catalog=YourDatabase;Integrated Security=true;Command Timeout=300"

Another option is to unchain the include statements. Note: this will not make your query faster. This will get all the data you need without triggering any timeout exceptions with multiple queries.

Let’s go back to our database and consider following query.

If you have thousands of data this will results in inner joins/ unions/ cartesian products (depending on your query), this query only deals with 5 entities but imagine if you have 10 what will happen. Resulting SQL looks like this.

SELECT [b].[Id], [b].[AuthorId], [b].[Title], [a].[Id], [a0].[Id], [t].[BookId], [t].[CategoryId], [t].[Id], [t].[CategoryName], [a].[FirstName], [a].[LastName], [a0].[AuthorId], [a0].[Biography], [a0].[DateOfBirth], [a0].[Nationality], [a0].[PlaceOfBirth]FROM [Books] AS [b]
INNER JOIN [Authors] AS [a] ON [b].[AuthorId] = [a].[Id]
LEFT JOIN [AuthorBiographies] AS [a0] ON [a].[Id] = [a0].[AuthorId]
LEFT JOIN (
SELECT [b0].[BookId], [b0].[CategoryId], [c].[Id], [c].[CategoryName]
FROM [BookCategories] AS [b0]
INNER JOIN [Categories] AS [c] ON [b0].[CategoryId] = [c].[Id]
) AS [t] ON [b].[Id] = [t].[BookId]
ORDER BY [b].[Id], [a].[Id], [a0].[Id], [t].[BookId], [t].[CategoryId]

Which is OK if timeout does not expire. But if it does, then you can do this:

This will break that large query into digestible pieces.

SELECT [b].[Id], [b].[AuthorId], [b].[Title], [t].[BookId], [t].[CategoryId], [t].[Id], [t].[CategoryName]
FROM [Books] AS [b]
LEFT JOIN (
SELECT [b0].[BookId], [b0].[CategoryId], [c].[Id], [c].[CategoryName]
FROM [BookCategories] AS [b0]
INNER JOIN [Categories] AS [c] ON [b0].[CategoryId] = [c].[Id]
) AS [t] ON [b].[Id] = [t].[BookId]
ORDER BY [b].[Id], [t].[BookId], [t].[CategoryId]

===================================================
SELECT [b].[Id], [b].[AuthorId], [b].[Title], [a].[Id], [a].[FirstName], [a].[LastName]
FROM [Books] AS [b]
INNER JOIN [Authors] AS [a] ON [b].[AuthorId] = [a].[Id]
===================================================SELECT [b].[Id], [b].[AuthorId], [b].[Title], [a].[Id], [a].[FirstName], [a].[LastName], [a0].[Id], [a0].[AuthorId], [a0].[Biography], [a0].[DateOfBirth], [a0].[Nationality], [a0].[PlaceOfBirth]
FROM [Books] AS [b]
INNER JOIN [Authors] AS [a] ON [b].[AuthorId] = [a].[Id]
LEFT JOIN [AuthorBiographies] AS [a0] ON [a].[Id] = [a0].[AuthorId]
===================================================SELECT [b].[Id], [b].[AuthorId], [b].[Title]
FROM [Books] AS [b]

This is the benchmark class for this followed by the results:

Fair enough we can see there is slight performance overhead (approximately 19%) with unchained include approach, yet it saves the day. Only use when dealing with large quantities of data and possible timeout exceptions.

🈁 Removing joins where possible

Can there be unnecessary joins? Consider following query:

This query projects Book’s title and Author’s first name where author Id is greater than 5. Of course this is valid code, not unnecessary, but there’s another way to do this.

This one ultimately removes the join but achieves the same with two queries. But if your database has thousands of records, earlier query will take a long time to execute, again throwing database timeout exceptions. Benchmark results also confirms that it take less time to retrieve all the records with two queries rather than a single query. Some might be a little bit skeptical about using Enumerable.Contains here, but it’s performance is greatly improved since EF 6.0.0 alpha 2

This is the benchmark class for this followed by the results:

If you are dealing with similar situations try do few tests on which approach suits you. Sacrifice single trip to the database to two trips, but removing a process hungry join operation can save the day sometimes.

☑️ Few more tips

There are few other things that you can do to speed up your queries:

  • Reducing columns reduces bandwidth required
  • Reducing rows, but introducing paging
  • Turning off change-tracking and identity-management (for example ObjectTrackingEnabled in LINQ-to-SQL) will reduce overheads post-processing the data
  • Using a pre-compiled query can sometimes help reduce preprocessing overheads

I’m gonna wrap things up here, these are some of the scenarios that I have came across as an engineer and they will help you too if you come across similar problems.

Read more on LINQ/EF performance:

Systems Design • Social Innovation • Cloud • ML