It’s almost a right of passage for a junior developer to cludge together their own CSV parser using a simple string.Split(‘,’), and then subsequently work out that there is a bit more to this whole CSV thing than just separating out values by a comma. In fact, CSV’s actually have plenty of gotchas around how line breaks are handled within fields, or how fields can be contained within quotes that just completely break a simply string split approach.
So with all that in mind, I recently took a look at my options for parsing CSV’s in C# .NET, here’s what I found!
IronXL is your go-to CSV Parser – Minimum Coding Required
In addition to CSV parsing in C#, IronXL converts CSVs to Excel with just two lines of code!
Using C# or VB.NET, it’s easy to use IronXL’s Excel API without the need for Interop. You can read, edit, and create Excel spreadsheets or work with other Excel formats such as XLS/XLSX/CSV/TSV. With support for multiple frameworks, begin with a 30-day trial key or contact our 24-hour support team to discuss licensing starting from $399.
CSV Gotchas
There are a couple of CSV gotchas that have to be brought up before we dive deeper. Hopefully they should go ahead and explain why rolling your own is sometimes more pain than it’s worth.
- A CSV may or may not have a header row. If there is a header row, then the order of the columns is not important since you can detect what is actually in each column. If there is no header row, then you rely on the order of the columns being the same. Any CSV parser should be able to both read columns based on a “header” value, and by index.
- Any field may be contained in quotes. However fields that contain a line-break, comma, or quotation marks must be contained in quotes.
- To re-emphasize the above, line breaks within a field are allowed within a CSV as long as they are wrapped in quotation marks, this is what trips most people up who are simply reading line by line like it’s a regular text file.
- Quote marks within a field are notated by doing double quote marks (As opposed to say an escape character like a back slash).
- Each row should have the same amount of columns, however in the RFC this is labelled as a “should” and not a “must”.
- While yes, the C in CSV stands for comma, ideally a CSV parser can also handle things like TSV (That is using tabs instead of commas).
And obviously this is just for parsing CSV files into primitives, but in something like .NET we will also be needing :
- Deserializing into a list of objects
- Handling of Enum values
- Custom mappings (So the header value may or may not match the name of the class property in C#)
- Mapping of nested objects
Setup For Testing
For testing out these libraries, I wanted a typical scenario for importing. So this included the use of different primitive types (strings, decimals etc), the usage of a line break within a string (Valid as long as it’s contained within quotes), the use of quotes within a field, and the use of a “nested” object in C#.
So my C# model ended up looking like :
class Automobile { public string Make { get; set; } public string Model { get; set; } public AutomobileType Type { get; set; } public int Year { get; set; } public decimal Price { get; set; } public AutomobileComment Comment { get; set; } } class AutomobileComment { public string Comment { get; set; } } enum AutomobileType { None, Car, Truck, Motorbike }
And our CSV file ended up looking like :
Make,Model,Type,Year,Price,Comment "Toyota",Corolla,Car,1990,2000.99,"Comment with a line break and "" quotes"
So a couple of things to point out :
- Our “Make” is enclosed in quotes but our model is not. Both are valid.
- The “Type” is actually an enum and should be deserialized as such
- The “Comment” field is a bit of a pain. It contains quotes, it has a line break, and in our C# code it’s actually a nested object.
All of this in my opinion is a pretty common setup for a CSV file import, so let’s see how we go.
CSV Libraries
So we’ve read all of the above and we’ve decided that rolling our own library is just too damn hard. Someone’s probably already solved these issues for us right? As it happens, yes they have. Let’s take a look at what’s out there in terms of libraries. Realistically, I think there is only two that really matter, CSVHelper and TinyCSVParser, so let’s narrow our focus down to them.
CSVHelper
Website : https://joshclose.github.io/CsvHelper/
CSVHelper is the granddaddy of working with CSV in C#. I figured I would have to do a little bit of manual mapping for the comment field, but hopefully the rest would all work out of the box. Well… I didn’t even have to do any mapping at all. It managed to work everything out itself with nothing but the following code :
TextReader reader = new StreamReader("import.txt"); var csvReader = new CsvReader(reader); var records = csvReader.GetRecords<Automobile>();
Really really impressive. And it handled double quotes, new lines, and enum parsing all on it’s own.
The thing that wow’d me the most about this library is how extensible it is. It can handle completely custom mappings, custom type conversion (So you could split a field into stay a dictionary or a child list), and even had really impressive error handling options.
The fact that in 3 lines of code, I’m basically done with the CSV parsing really points out how good this thing is. I could go on and on about it’s features, but we do have one more parser to test!
Tiny CSV Parser
Next up was Tiny CSV Parser. Someone had pointed me to this one supposedly because it’s speed was supposed to be pretty impressive. We’ll take a look at that in a bit, but for now we just wanted to see how it handled our file.
That’s where things started to fall down. It seems that Tiny CSV doesn’t have “auto” mappings, and instead you have to create a class to map them yourself. Mine ended up looking like this :
class CsvAutomobileMapping : CsvMapping<Automobile> { public CsvAutomobileMapping() : base() { MapProperty(0, x => x.Make); MapProperty(1, x => x.Model); MapProperty(2, x => x.Type, new EnumConverter<AutomobileType>()); MapProperty(3, x => x.Year); MapProperty(4, x => x.Price); MapProperty(5, x => x.Comment, new AutomobileCommentTypeConverter()); } } class AutomobileCommentTypeConverter : ITypeConverter<AutomobileComment> { public Type TargetType => typeof(AutomobileComment); public bool TryConvert(string value, out AutomobileComment result) { result = new AutomobileComment { Comment = value }; return true; } }
So, right from the outset there is a bit of an overhead here. I also don’t like how I have to map a specific row index instead of a row heading. You will also notice that I had to create my own type converter for the nested comment object. I feel like this was expected, but the documentation doesn’t specify how to actually create this and I had to delve into the source code to work out what I needed to do.
And once we run it, oof! While it does handle almost all scenario’s, it doesn’t handle line breaks in a field. Removing the line break did mean that we could parse everything else, including the enums, double quotes, and (eventually) the nested object.
The code to actually parse the file (Together with the mappings above) was :
CsvParserOptions csvParserOptions = new CsvParserOptions(true, ','); var csvParser = new CsvParser<Automobile>(csvParserOptions, new CsvAutomobileMapping()); var records = csvParser.ReadFromFile("import.txt", Encoding.UTF8);
So not a hell of a lot to actually parse the CSV, but it does require a bit of setup. Overall, I didn’t think it was even in the same league as CSVHelper.
Benchmarking
The thing is, while I felt CSVHelper was miles more user friendly than Tiny CSV, the entire reason the latter was recommended to me was because it’s supposedly faster. So let’s put that to the test.
I’m using BenchmarkDotNet (Guide here) to do my benchmarking. And my code looks like the following :
[MemoryDiagnoser] public class CsvBenchmarking { [Benchmark(Baseline =true)] public IEnumerable<Automobile> CSVHelper() { TextReader reader = new StreamReader("import.txt"); var csvReader = new CsvReader(reader); var records = csvReader.GetRecords<Automobile>(); return records.ToList(); } [Benchmark] public IEnumerable<Automobile> TinyCsvParser() { CsvParserOptions csvParserOptions = new CsvParserOptions(true, ','); var csvParser = new CsvParser<Automobile>(csvParserOptions, new CsvAutomobileMapping()); var records = csvParser.ReadFromFile("import.txt", Encoding.UTF8); return records.Select(x => x.Result).ToList(); } }
A quick note on this, is that I tried to keep it fairly simple, but also I had to ensure that I used “ToList()” to make sure that I was actually getting the complete list back, even though this adds quite a bit of overhead. Without it, I get returned an IEnumerable that might not actually be enumerated at all.
First I tried a 100,000 line CSV file that used our “complicated” format above (Without line breaks in fields however as TinyCSVParser does not handle these).
Method | Mean | Scaled | Allocated | -------------- |-----------:|-------:|----------:| CSVHelper | 1,404.5 ms | 1.00 | 244.39 MB | TinyCsvParser | 381.6 ms | 0.27 | 32.53 MB |
TinyCSVParser was quite a bit faster. (4x infact). And when it came to memory usage, it was way down on CSVHelper.
Let’s see what happens when we up the size to a million rows :
Method | Mean | Scaled | Allocated | -------------- |---------:|-------:|----------:| CSVHelper | 14.340 s | 1.00 | 2438.4 MB | TinyCsvParser | 4.142 s | 0.29 | 318.65 MB |
It seems pretty linear here in terms of both memory allocated, and the time it takes to complete. What’s mostly interesting here is that the file itself is only 7.5MB big at a million rows, and yet we are allocating close to 2.5GB to handle it all, pretty crazy!
So, the performant status of TinyCSV definitely hold up, it’s much faster and uses far less memory, but for my mind CSVHelper is still the much more user friendly library if you are processing smaller files.
It’s actually spelled “rite of passage”.
Hi,
Please ignore the performance results if you’re not interested in reading the full CSV in memory at once. If you only need one record at a time CsvHelper will save a lot of memory and therefore performance (https://joshclose.github.io/CsvHelper/getting-started):
“The GetRecords method will return an IEnumerable that will yield records. What this means is that only a single record is returned at a time as you iterate the records. That also means that only a small portion of the file is read into memory. Be careful though. If you do anything that executes a LINQ projection, such as calling .ToList(), the entire file will be read into memory. CsvReader is forward only, so if you want to run any LINQ queries against your data, you’ll have to pull the whole file into memory. Just know that is what you’re doing.”
var records = csvReader.GetRecords();
Returns an ienumerable, which you can enumerate. While enumerating only one record will be returned and stored in memory.
In the performance test, you do:
return records.ToList();
Which will read the full CSV and pull it into memory. Not really performant at all indeed.
Nice add Maarten! Do you think not ToList() initially will also improve speed?
I found this very helpful. Thank you Wade. I’ve used CsvHelper once before. I think I’ll stick with it.
Thank you! CsvHelper looks like the recommendation we needed.
Thanks so much! This was very helpful!
Looks like there is an added step in the nuget package as of 2020.05.03.
TextReader > CsvParser > CsvReader > Data.
Still; it looks like this is still a solid way to go.
Is there suppert for modern IO approaches as IO.Pipelines, Span, Memory?
One negative point is that the documentation for CSVHelper is very, very sparse. It tells me nothing on how to do custom type conversions.
Just sharing my library to parse csv and fixed length files.
It has really high performance due to Span, Memory, skiplocalsinit, expression trees and other things.
You can also use IO.Pipelines and avoid allocation in heap memory
https://github.com/leandromoh/RecordParser/
These days, there are other options as well. My SoftCircuits.CsvParser is available on Nuget and Github. It can be as much as four times faster than CsvHelper.