Solving Nodes RAM consumption with RAID 100 alike and 2 PERC Controllers

crt (55) 7 years ago (edited)

The problem with NVME raid 0 is that's only possible to achieve with software raid. Meaning, no battery backing, no controller cache. Also, Raid 1 on the bottom is important to preserve data, combined with battery. In regards to crash-es, in case of power outage, scripts read the USP status via SNMP and issues SIGINT to steemd, so that the data is safely put on disk. I also did a couple of edits here and there in the code to ensure corruption free files even in case of steemd crash. (although not properly tested yet, but will submit merge request on github if it shows well with my server).

While theoretically you are right in terms of speed, I still prefer industry grade approach. It's much easier to scale. I also have a hot spare, dual power supplies, ECC memory. These things are important. (Learned the hard way working with mission critical systems).

It could be "a psychology" problem of mine :) but I don't trust the system unless I can replace the faulty drive, or power supply without guest OS even being aware that anything happened at all.

As for the seed nodes, I did managed to perform p2p load balancing based on block number using CCR1036-12G-4S-EM and some scripting. (not being used yet as it needs to be properly tested).

The way I see witness work in 1 year from now, only the 'by the book' industry grade systems with well planned topology would survive and be able to scale. While it's possible to improvise, it comes with the price, many of us (including myself) learn the hard way.

Another problem is that people who are experienced with mission critical systems outside of crypto industry are hard to get into the projects such this one. While the people from crypto are usually tends to experiment with DIY solutions. While it's fun, the problem with the lack of first category is important knowledge transfer that shall take place to ensure that not every single lesson is paid again.

I'll try to implement as much as possible from the industry best practices into this (and other projects i'm involved into) in order to better understand the technology challenges.

While we are experimenting a DIY solutions, IBM (and others) are watching what we are doing with their proprietary blockchain solutions, and once we prove the good working model, they would be able to deploy steem-like platform in a matter of seconds, that will be more stable and scalable in every possible way. That's what i'm afraid of, that current technologies are just a testing polygon for corporate players.

(These are my predictions, and articles such as this one are my contribution for the future if it happens my view to be correct). - or it does not have to be at all.

$0.00

themarkymark (78) 7 years ago

The problem is the hardware requirements for a full node on Steem are so ridiculous that you have to consider DIY, especially since there is no data loss in a complete failure scenario.

Enterprise quality hardware for a full node would be $30-40K. A Dell server that meets requirements today is around $20k and that's without any redundancy or hardware RAID. The fact the hardware would be outgrown in 6 months makes it even worse.

There is a good chance all these servers we are setting up for full nodes with 512GB of ram won't be sufficient in 6 months. AppBase promises to help in this regards, as does account history changing to 30 days instead of since block 0 on full nodes, but then again Steem Mobile Wallet was coming out in November.

Spending $20k-$30k+ on hardware that may last 6 months isn't a very smart move, especially when there is no financial recovery by running a full node, only a witness node and those positions with the funding to do such a thing are pretty much locked in stone.

$0.04

crt (55) 7 years ago

Just have in mind, that for the price of two NVE, you can get a refurbished enterprise grade server with controller and even 240GB ECC ram on ebay. (The cost of the new one would be 20-30k).

But you are right on many points. Unless there are other motives, most likely investing in enterprise grade infrastructure is not a wise choice and it's likely not to generate ROI.

$0.00

crt (55) 7 years ago (edited)

For example, ECC Ram is crucial thing so you don't need to rebuild from scratch even in case of crash. Not sure how many people are aware of that. (not talking about RAM as a storage, just for the operation of steemd). When ECC Ram is used, the chances of file corruption are unlikely to happen even when kill -9 is used.

On another side, I am a bit fascinated about these things not being documented, and I got an impression that there is a strong amount of 'selfishness' when it comes to knowledge exchange, which is totally opposite from Open Source Ideology, that blockchain evolved from.

$0.00

gtg (70) 7 years ago

Wow, nice storage :-)

but...

For your steemd configuration this setup is irrelevant. Those plugins are not use much memory anyway, also

Current memory consumption is: 54G

as mentioned on chat, that's because you have wrong build.

For this config you should use -DLOW_MEMORY_NODE.

But of course for full node that beast can be very effective.

However, please note that currently your results currently are heavily biased by memory caching.

Would you mind running:
dd if=/dev/zero of=tst.tmp bs=4k count=10k oflag=dsync
out there?

(General approach is good, I'm using fast storage on my nodes to compensate lack of RAM)

crt (55) 7 years ago (edited)

Absolutely,

root@steem:/home# dd if=/dev/zero of=tst.tmp bs=4k count=10k oflag=dsync
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 0.0205423 s, 2.0 GB/s

(Keep in mind that the controllers itself are 8GB DDR4 Ram (each), so the results could differ from what's actually happening on the first level array (where it's probably much slower). But I guess you were thinking of that with memory cached results.

As for the -DLOW_MEMORY_NODE flag, I am trying to skip that, as I do plan to enable as much plugins as possible, eventually going with full node (all the available plugins).

I guess we need to see what is going to happen then.

$0.29

2 votes

gtg (70) 7 years ago

That's what I'm afraid. I was running full node on HDDs with decent RAID, and everything works fine as long as cache (on various levels) can keep up with load. But of course with much, much lower latency on SSDs it might be a viable solution. (HDD are useful because you can spot latency issues much earlier).

crt (55) 7 years ago (edited)

Agree in full. Still I found such experiments are crucial for steemit scalability, and I'm very glad to see more people are thinking of it, as well as having 'organ' that is supposed to perform thinking ;)

The results above could be explanation of ESXI / Bare Metal differences, as it's likely that Ubuntu Drivers are not handling cache correctly.

It would be interesting to see if there would be any difference if I replace SSD with Mechanical SAS drives. (Probably not going to manage until the end of the week, but will update). But this will give full insight on SSD/HDD latency difference once the cache can't keep up with the data.

As for the memory consumption, I think that df -h is totally irrelevant here, as 'HotPlug Memory' and 'Reserve All' is configured at the VCenter. This is the VCenter output of memory usage of node. (1.6 GB Only). VCenter can be very mysterious with stats sometimes.

Screen Shot 2018-02-18 at 11.06.36 AM.png

While I do think better SSD drives could possibly offer a good alternative to scale Steemit, I still put hopes in Stunnel -> HaProxy -> Stunnel -alike clustering (Just for the purpose of understanding). Following your idea of nodes with different plugins and examining tcp dumps, I found it possible to redirect the traffic based on transaction type.

Anyhow, experimentation and communication like this one is a very good signal for steemit future scalability and therefore sustainability.

Nice having you around.

Disclaimer: I don't hide my real motives, experimentations performed here are done both for the benefit for community but as well to expand my knowledge in BIG Data, as I do full time consultancy for a living.

$0.28

gtg (70) 7 years ago

Sounds good :-)
Take a look at jussi (JSON-RPC 2.0 Reverse Proxy).

crt (55) 7 years ago

Very Interesting for RPC (will definitely experiment), but it's P2P that is catching my toughs.

$0.00

gtg (70) 7 years ago (edited)

No, wrong way. Turn around. With current architecture we can't split that, and definitely you won't do that on a network level.