使用Boost對復雜數據進行C ++序列化

[英]C++ serialization of complex data using Boost


I have a set of classes I wish to serialize the data from. There is a lot of data though, (we're talking a std::map with up to a million or more class instances).

我有一組我希望序列化數據的類。但是有很多數據,(我們正在討論一個帶有多達一百萬或更多類實例的std :: map)。

Not wishing to optimize my code too early, I thought I'd try a simple and clean XML implementation, so I used tinyXML to save the data out to XML, but it was just far too slow. So I've started looking at using Boost.Serialization writing and reading standard ascii or binary.

我不想太早地優化我的代碼,我想我會嘗試一個簡單而干凈的XML實現,所以我使用tinyXML將數據保存到XML,但它太慢了。所以我開始考慮使用Boost.Serialization編寫和讀取標准ascii或二進制文件。

It seems to be much better suited to the task as I don't have to allocate all this memory as an overhead before I get started.

它似乎更適合於任務,因為我沒有必要在開始之前將所有這些內存分配為開銷。

My question is essentially how to go about planning an optimal serialization strategy for a file format. I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after. Having played around with serialization a little (and looked at the output), I don't understand how loading the data back in could know when it's reached the end of the map for example, if I simply save out all the items one after another. What issues do you need to consider when planning a serialization strategy?

我的問題基本上是如何為文件格式規划最佳序列化策略。如果沒有必要,我不特別想要序列化整個地圖,因為它實際上只是我追求的內容。稍微討論了序列化(並查看了輸出),我不明白如何加載數據可以知道它何時到達地圖的末尾,例如,如果我只是一個接一個地保存所有項目。在規划序列化策略時,您需要考慮哪些問題?

Thanks.

5 个解决方案

#1


There are many advantages to boost.serialization. For instance, as you say, just including a method with a specified signature, allows the framework to serialize and deserialize your data. Also, boost.serialization includes serializers and readers for all the standard STL containers, so you don't have to bother if all keys have been stored (they will) or how to detect the last entry in the map when deserializing (it will be detected automatically).

boost.serialization有許多優點。例如,正如您所說,只包含具有指定簽名的方法,允許框架序列化和反序列化您的數據。此外,boost.serialization包括所有標准STL容器的序列化器和讀取器,因此您不必擔心所有鍵是否已存儲(它們將)或如何在反序列化時檢測映射中的最后一個條目(它將是自動檢測)。

There are, however, some considerations to make. For example, if you have a field in your class that it is calculated, or used to speed-up, such as indexes or hash tables, you don't have to store these, but you have to take into account that you have to reconstruct these structures from the data read from the disk.

但是,有一些考慮因素需要考慮。例如,如果您的類中有一個字段用於計算或用於加速(例如索引或哈希表),則不必存儲這些字段,但您必須考慮到必須從磁盤讀取的數據重建這些結構。

As for the "file format" you mention, I think some times we try to focus in the format rather than in the data. I mean, the exact format of the file don't matter as long as you are able to retrieve the data seamlessly using (say) boost.serialization. If you want to share the file with other utilities that don't use serialization, that's another thing. But just for the purposes of (de)serialization, you don't have to care about the internal file format.

至於你提到的“文件格式”,我想有時我們會嘗試關注格式而不是數據。我的意思是,只要您能夠使用(例如)boost.serialization無縫地檢索數據,文件的確切格式就無關緊要了。如果您想與不使用序列化的其他實用程序共享該文件,那就是另一回事。但僅僅出於(反)序列化的目的,您不必關心內部文件格式。

#2


Read this FAQ! Does that help to get started?

閱讀此常見問題!這有助於開始嗎?

#3


I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after.

如果沒有必要,我不特別想要序列化整個地圖,因為它實際上只是我追求的內容。

Does that mean you don't really need to serialize the whole object? Maybe you should reconsider just using a text-based format. If you really need to serialize only a subset of the key/value pairs in a map then you should probably just write them to a text file and read them in later. You don't necessarily need XML; just one line per map key followed by one line with the value should work.

這是否意味着你真的不需要序列化整個對象?也許你應該重新考慮使用基於文本的格式。如果您確實只需要在地圖中序列化鍵/值對的子集,那么您應該將它們寫入文本文件並在以后讀取它們。您不一定需要XML;每個地圖鍵只有一行,后跟一行,值應該有效。

#4


If all you want is key value pairs then the important thing is the types the keys and values take, this will colour how you deal with things.

如果您想要的只是鍵值對,那么重要的是鍵和值所采用的類型,這將為您處理事物的方式着色。

Serialising the map itself would be a poor plan in general since you may wish to change your associative container type later but not invalidate (or have to translate) previous serialised files.

序列化地圖本身通常是一個糟糕的計划,因為您可能希望稍后更改關聯容器類型但不會使先前的序列化文件無效(或必須轉換)。

Serialising the container can be useful in certain circumstances if you wish to avoid the cost of rebuilding the container again (but pre-sizing the container is normally sufficient to avoid the vast majority of this overhead) but this should be a decision based on specific aspects of your application and usage.

如果您希望避免再次重建容器的成本,那么序列化容器在某些情況下會很有用(但是預先調整容器大小通常足以避免絕大部分的開銷)但這應該是基於特定方面的決定您的應用程序和用法。

If you supply the type of the key/values we can help more. without this here are some general tips:

如果您提供鍵/值的類型,我們可以提供更多幫助。沒有這個是一些一般的提示:

  • If they are amenable to string representation then a simple CSV file may be sufficient (but use an existing reader writer library for it, reading and writing legit CSV is harder than it looks superficially)
  • 如果它們適合字符串表示,那么一個簡單的CSV文件就足夠了(但是使用現有的讀寫器庫,讀取和寫入合法的CSV比表面看起來更難)

  • IF they are fixed width then a simple binary format will make reading and writing very easy (and quick) but care should be taken to acknowledge the issues of:
    • endianess
    • whether you wish to allow simple catting of such files together or add CRC like values for integrity (you can do both but it's harder)
    • 是否希望允許簡單地將這些文件放在一起,或者為了完整性而添加類似CRC的值(你可以做到這兩點但是更難)

    • You lose the ability to grep the files (this is a real loss, you may end having to reinvent parts of your toolchain for this)
    • 你失去了grep文件的能力(這是一個真正的損失,你可能最終必須重新發明你的工具鏈的一部分)

    • whether changing platform/compiler/size_t will break the format
    • 更改platform / compiler / size_t是否會破壞格式

  • 如果它們是固定寬度,則簡單的二進制格式將使讀取和寫入非常容易(並且快速)但是應該注意以下問題:endianess是否允許簡單地將這些文件一起捕獲或者添加類似於CRC的值完整性(你可以做到這兩點,但它更難)你失去了grep文件的能力(這是一個真正的損失,你可能最終不得不重新發明你的工具鏈的部分)是否更改platform / compiler / size_t將打破格式

  • Some structured textual format that is lighter than XML. There are several JSOM/YAML etc. These will provide extensibility you quite likely don't require.
  • 一些比XML輕的結構化文本格式。有幾個JSOM / YAML等。這些將提供您很可能不需要的可擴展性。

#5


Use Google's Protocol Buffers which is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.

使用Google的協議緩沖區,這是一種與語言無關,平台無關,可擴展的序列化結構化數據的方式,可用於通信協議,數據存儲等。 Google對幾乎所有內部​​RPC協議和文件格式都使用Protocol Buffers。

There are bindings for C++, Java, Python, Perl, C#, and Ruby.

有C ++,Java,Python,Perl,C#和Ruby的綁定。

You describe your data in metadata .proto files

您可以在元數據.proto文件中描述數據

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

Then you would use it in C++ like this:

然后你會在C ++中使用它,如下所示:

Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("bob@example.com");

fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();

Or like this:

或者像這樣:

Person person;
fstream in("person.pb", ios::in | ios::binary);
if (!person.ParseFromIstream(&in)) {
  cerr << "Failed to parse person.pb." << endl;
  exit(1);
}

cout << "ID: " << person.id() << endl;
cout << "name: " << person.name() << endl;
if (person.has_email()) {
  cout << "e-mail: " << person.email() << endl;
}

For a more complete example, see the tutorials.

有關更完整的示例,請參閱教程。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2009/02/16/72ff2689b8dd5c2f858919ced9509576.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com