-
Type:
Improvement
-
Resolution: Fixed
-
Priority:
Major
-
Component/s: DB
-
None
Source IDs form one component of the primary identifier of every datum collected in SolarNetwork. They often have a lot of information encoded into them via path segments, and are crucial to the day-to-day operations of consumer applications. As the volume of datum collected by SolarNetwork grows, some friction points have become apparent in how source IDs have been architected:
- It is not uncommon for source IDs for already-collected data to change over time, however changing it requires deleting the existing data and re-importing it with the new source ID. This is incredibly slow, error-prone, and computing resource expensive.
- It can also happen that node IDs for already-collected data need to change, for example when a node is replaced but it is desirable to the data from both nodes appear as one.
- Source IDs are relatively long string identifiers, and consume a relatively large amount of space when stored for each datum row in a database. This impacts the speed/efficiency of both storing datum and retrieving them for queries.
A new architectural approach to how node and source IDs are stored has been forming in my mind to address these points. Instead of storing the actual node and source ID values directly, each unique surrogate ID can be assigned by SolarNetwork and that surrogate ID stored with datum. A secondary table that maps those surrogate IDs to actual node and source ID values would provide a way to translate between the two.
The current datum table primary key can be thought to be defined like this:
CREATE TABLE datum ( node_id BIGINT, ts TIMESTAMP, source_id CHARACTER VARYING(64), PRIMARY KEY (node_id, ts, source_id) )
A new approach using UUID surrogate IDs would look like this:
CREATE TABLE datum ( stream_id UUID, ts TIMESTAMP, PRIMARY KEY (stream_id, ts) ); CREATE TABLE datum_stream_map ( node_id BIGINT, source_id CHARACTER VARYING(64), stream_id UUID PRIMARY KEY (stream_id), UNIQUE (node_id, source_id) );
Here the datum_stream_map table provides the mapping between node + source ID pairs and stream ID values.
This this new approach, changing a node or source ID would effectively be free: just a change to a row in the datum_stream_map table. The datum table would have a fixed-size primary key no matter how big the source ID values became (the limit to 64 characters could easily be removed or increased even).
The only situation that would still require deleting existing datum and re-importing it would be when the node or source ID needs to change for only a subset of the existing datum for a given stream ID needs to change. Then that subset would need to have its node_id or source_id value updated. A specific API could be exposed to perform that task, however, with the understanding that replacing a fixed-size stream ID value might more efficiently be handled by the database than the variable-sized string value could. If that is not true or feasible, then because this scenario is really an edge case it might be acceptable to still require deleting/importing the data to achieve the desired outcome