FTS integration Solr or Elastic Search

Hi,

FTS support is great! Can FTS be integrated something like Solr? Solr and Elastic Search work out of the box and they are very stable solutions. With this approach the framework does not need to deal with the search business requirements and updates of Lucene library.

Hi,

i really like the idea of supporting Elasticsearch. Although i think library updates are not a big problem, because it will mainly be handled by the CUBA team internally. Even if you are using Elasticsearch you will have the same problem of supporting a new major version of ES with the CUBA app (in fact it is even a little bit harder, because the version of CUBA cannot bring in its tested depdency version of lucene already baked in, but has to support ranges of those services).

The main reason i would really like to see it happening is the fact, that deploying an CUBA app becomes easier. The FTS datastore (which is currently in the local filesystem of the application) will get pushed out and used as a service via an API. The CUBA app will become even more stateless. Docker, Kubernetes, Clound Foundry deployments will become easier since there is no local filesystem to deal with anymore.

The FTS datastore will be treated like a regular datastore (like the RDBMS). Backups, HA etc. can be operated in the same way as for other datastores.
If the user wants to shift heavy lifting on running such a datastore away, (s)he is free to do so with something like Managed Open-Source Elasticsearch and OpenSearch Search and Log Analytics – Amazon OpenSearch Service – Amazon Web Services because it is just a API usage.

These are just my 0.02 $ to the topic. Perhaps this is really something you could think of implementing.

That could be an interesting idea, and I’d go with ElasticSearch for its popularity.

Regarding the app dependencies, that’s going to be a no-problem:

https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-low.html

They’re gradually in the process of deprecating the fat Java client, in favor of a dependency free REST API (apart from org.apache.httpcomponents:* libs and a couple of commons).
That means on one hand more effort when developing the client, but on the other hand you’ll end up with way less dependencies to track/update.

So if I’m reading this correctly, if using FTS, the Lucene indexes need to be on a network share, since each middle tier server needs access.

Seems that using Solr is certainly better than direct Lucene for this reason alone. Creating a enterprise app, and then a single point of failure (for Lucene indexes) appears to be a big drawback.

Is someone from Haulmont going to respond? This appears to be a significant limitation in the FTS ? I would certainly be willing to make the changes to have FTS be a more abstract implementation so it could be backed by Lucene or Solr - or possibly Elastic Search. Since the FTS integration is embedded into the UI as well, it needs to be changed at the platform level.

I don’t think this is the case, given that the concrete FTS capability is provided by an application component.
There are the following references in the main modules:

  • core -> provides support for enqueing entities to be processed by FTS. It defines an FtsSender interface that is implemented by a bean in the FTS app component

  • global -> defines an FtsQueue entity backed by the SYS_FTS_QUEUE table. Here there’s a class named FtsConfigHelper that uses reflection for loading a concrete class in the FTS app component (com.haulmont.fts.global.FtsConfig), so this needs further investigation if one wants to completely replace the official FTS component with an alternative one…

  • gui -> provides FTS filtering support, again using only high level interfaces and not calling into the FTS component directly (for example it defines a FtsFilterHelper interface, whose concrete implementation resides in the FTS component)

  • web -> provides an FTS field in the main window to enable a global search using FTS. The search results are handled by a window whose id is ftsSearch, and the window itself is defined in the web module of the FTS component (specifically by two classes: a container implemented by the SearchLauncher class, and the actual results handled by the SearchResultsWindow class)

So in the end, apart from the explicit reference to the FtsConfig configuration interface, there aren’t hard-coded dependencies with the FTS app component.
In theory it should be possible to swap the component with another, by adhering to the same interfaces/conventions.

But I think that a better approach could be implementing alternative beans on top of the existing interfaces provided by the official FTS component (FtsService, FtsManagerAPI). Doing so there will be no need to replace the entire FTS functionality, only the core indexer (for example the original FtsSenderBean responsible to mange the queue will work seamlessly because it depends on the FtsManagerAPI contract, not actual implementation).

This is my personal analysis and view, let’s see if someone from Haulmont will comment officially on this.

Paolo

I don’t think this is true. Actually if multiple backend servers would access the same “FTS index database files” they would probably corrupt the data because there is no proper DB system in front of to manage the concurrent writes. This would be as if you have a postgres cluster and point all of them to the same data files.

Therefore I guess the current way it works is, that every CUBA backend server builds it’s own independent lucene index - but I might be wrong on that.

From what I see it might be a major refactoring and it’s questionable if Elasticsearch offers the same level of details in their API that the FTS app component is using currently with the direct interaction with lucene.

Nevertheless, I think it might make sense to do because of the stuff I wrote above…

Bye
Mario

Personally I don’t think it’s a major refactoring, but I can’t tell for sure without some POC work to prove that we can swap the underlying indexing beans. I’m referring to the second alternative I mentioned before, that is using the official FTS implementation as the base, and trying to swap the two beans responsible for interacting with Lucene.

Regarding the API provided by alternative services, I think that should not be a problem, the concepts are all very similar, so the API surface should be pretty analogous. Again this is only a 10 miles view, and only some work on a POC could prove this…

Paolo

If each middle tier server maintains its own index then there needs to be some global persistent message bus JMS/redis? To send all of the update operations to each server.

I am just getting my feat wet with CUBA but it was my understanding that there was no central messaging except for the database and that is why you needed an external second level cache for some of these things. I would think that this sort of architecture would be somewhat Inefficent for systems like Lucene and it would seem that consistency would be a problem as the entire system would need to be JTA aware to be consistent with the database.

Is there a link to some documentation that describes the FTS architecture?

No, that’s not what @mario pointed out… what he meant is that each middleware server owns its private copy of the index, so potentially a user that is connected to host A could see different search results from a user connected to host B, if the indexes are in a different state.
If you see the definition of the FtsQueue entity you’ll see an optional attribute indexingHost, and then you have the fts.indexingHosts config property, that maintains the list of the hosts that owns a Lucene index… When an entity is “enqueued” for indexing, if that property is not empty, each FtsQueue instance (that is each SYS_FTS_QUEUE table record) will be duplicated for each host, so that each host (independently) will be able to populate its private index.
I haven’t tried this personally, but I think that Mario’s conclusions are right.

So the global messaging is performed via a database table. Makes sense since each middle tier already has a connection to the database. It also solves the transaction concerns - but I would expect the additional database writes per entity write would cause some performance issues with lots of middle tier servers which is why a system like Solr would be more efficient.

Obviously this is only a concern for a system with a high volume of writes.

But another issues arises in that each middle tier server needs enough disk space to hold the index and cpu availability to perform the indexing. Seems pretty inefficient… but again without some architecture documentation it’s hard to say.

Or maybe I’m misunderstanding something and not every middle tier server needs to be an index and it’s uses remote services to perform the searches.