Summa Roadmap
Contents
Contents
How to Read this Document
The entries listed below are tasks that should be addressed in roughly sequential order. The roadmap has been broken down in four more or less independent groups.
Meta Project Tasks mostly of political matter related to the project
Base System Code contributing to the core infrastructure
Summa Core The bare bones distributed search engine. Only RMI interfaces will be available at this point.
Webservices Webservices wrapping the Summa Core search engine
Optional tasks Tasks that will not be addressed for Summa 1.0, but are likely to receive attention at some undefined point in the future
Legend: The
mark is used to denote completed tasks
Important about estimation
The estimated time requirements below are idealized man-days. An idealized man-day means that the a person with full knowledge about the task, the relevant technology, and the code (basically Bolette, Mikkel or Toke, depending on the specific task) works undisturbed for a full work-day.
In daily life at Statsbiblioteket, this translates to 1½-2 work-days for the right person at Statsbiblioteket and maybe 1½-2 times that for a person with the right qualifications and a fair amount of insight in the Summa project. Of course this fluctuates a great deal depending on the task and person.
Summa 1.0
The deadline for Summa 1.0 is in September 2008.
Release Goals for Summa 1.0
- The overall architecture in place
- Searching and facet generation is working
- Exposing web services for search, facets, and storage
- API stability
Meta Project Tasks for Summa 1.0
- License
- Keywords/subjects, Kickoff
- Open source management
We need to update the wiki in several areas. Both for ECDL and Summa 1.0.
Time remaining: 4md
Create a bootable USB pen drive with Summa 1.0
Time remaining: 2md
Base System Tasks for Summa 1.0
Skills needed: Overall Summa architecture. Design of distributed systems
Time remaining: 0 md
Implement a way to deploy new Score clients over SSH
Skills needed: SSH, terminal, and Java process handling.
Remaining time: 0 md
Design and implement a centralized configuration system for integration with Score
Skills needed: Distributed system design, Java RMI familiarity
Remaining time: 0 md
Configuration system, Xstream backend Implement a ConfigurationStorage utilizing XStream, allowing nested Configurations
Skills needed: XStream, XML, knowledge on the Summa configuration system
Remaining time: 0 md
Client/Service package format (spec only) Spec out a format for Score Client and Service bundles
Skills needed: Distributed system design, Java RMI knowledge, Jar knowledge, XML
Remaining time: 0 md
Implement enough of the Score module to allow the clients and Score to talk together
Skills needed: Java RMI, Summa configuration system design
Remaining time: 0 md
Implement deployment and management of Score Clients as well as their child Services. Can use polish but is functionally complete
Skills needed: Java RMI, Summa configuration systems and Score design
Remaining time: 0 md
Implement a CLI driven UI for controlling the Score server and clients
Skills needed: Java RMI, Summa Score design, commons-cli
Remaining time: 0 md
Control test, create and deploy a test package (plain) Deploy a simple test service in a client via the score
Skills needed: Summa Score design
Remaining time: 0 md
Create an easy-to-setup Score package.
Skills needed: Score design knowledge, Ant, Bash scripts
Remaining time: 0 md
(DROPPED)
Fast Streaming IPC, design WARNING: This item is due for change. Design a low-latency IPC framework for distributed systems
Skills needed: IPC, design of distributed systems
Remaining time: 1 md
(DROPPED)
Fast Streaming IPC, implementation WARNING: This item is scheduled for change.
Skills needed: Java RMI, design of distributed systems
Remaining time: 4 md
(DEFERRED)
Control test, create and deploy a test WS Create a test service with a WS in a Tomcat and deploy that in a Score Client
Skills needed: Tomcat, Webservices, Summa Bundle spec and Score design
Remaining time: 2 md
Expected state after this: The logistic part of Summa is up and running. Distributed setup is controlled by a central service called Control. The Control can deploy and control clients on remote machines. Services on the Clients can be deployed remotely and controlled by the Control. Fine-grained control of Services are done with JMX. The framework provides methods for copying of files and folders across the network.
Summa Core Tasks for Summa 1.0
Finish implementation of Derby-backed Metadata Storage. See also Roadmap/MetadataStorageMultiVolume
Skills needed: Basic database knowledge (SQL), basic knowledge about Java hookup to databases, basic knowledge about testing.
Remaining time: 0 md.
Storage revamp of API to batch-orientation Test reworking of MetadataStorage to batch-orientation.
Skills needed: Knowledge about testing, insight in the working of the MetadataStorage and Index modules.
Remaining time: 0 md.
Ingest, Streaming filter framework Rewrite the ingester to be stream-oriented.
Skills needed: Experience with Java streams, loading and instantiating classes.
Remaining time: 0 md.
Ingest, Streaming, Storage commit filter (rewrite Ingest module) Implement ingest filter for the streaming filter framework and finish up the ingester.
Skills needed: Knowledge on Java streams, Java RPC, Metadata Storage and testing.
Remaining time: 0 md.
Expected state after this: Summa is capable of ingesting records into a Storage. The records can be new, updated or deleted. It is possible to iterate through records in order of last update.
Hardware Aquisition (and testing) Based on testing, estimate the hardware needed to fulfill the goals for local installation of Summa.
Skills needed: Knowledge on testing, hardware knowledge, access to relevant domain-specific test-data.
Remaining time: 0 md.
Rework the existing indexer to allow for a more flexible workflow.
Skills needed: Extensive knowledge on the existing indexer, Lucene indexing in general and Score.
Remaining time: 0 md.
Webservices Tasks for Summa 1.0
Create a stand alone webservice exposing the core Summa search API
Skills needed: Webservices, Tomcat, Summa distributed search knowledge
Remaining time: 0 md
Create a simple test website.
Skills needed: Tomcat, JSP, Webservices, XSLT
Remaining time: 0 md
Expected state after this: A simple search-only website is running on top of Summa.
Wrap the MetadataStorage in a webservice.
Skills needed: Tomcat, webservices, Summa distributed architecture
Remaining time: 0 md
Wrap the facet API in a webservice
Skills needed: Tomcat, webservices, Summa facet implementation
Remaining time: 0 md
Expected state after this: A simple website with search, facet browsing and full record view is running on top of Summa.
Summa 1.1
Summa 1.1 is set to be released 2008-10-30. Summa 1.1 is the version we are going to deploy for production use on Statsbiblioteket (1.0 will feature in beta versions of the website).
Overall Summa 1.1 Goals
- Feature parity with the current in-production Summa on Statsbiblioteket
Control is able to automatically manage a set of clients
- Being able to migrate the legacy data on Statsbiblioteket
Summa 1.1 Tasks
A Storage implementation aggregating a collection of sub-Storages identifying each one by base name. This aggregating Storage really only need to implement getRecord().
Skills needed: Summa Storage knowledge
Time left: 0 md
(
)
Storage logic to handle relations and multi volume works (Punted from 1.0) Update metadata Storage to change ingested records based on parent/child relations.
Skills needed: SQL, Storage design
Remaining time: 1 md.
The Storage API is currently leaking implementation details. Specifically RecordIterator and RecordAndNext. These should not be visible when we freeze the interface.
Skills needed: Summa API and design. Java generics
Remaining time: 0 days
Control able to ensure that a set of clients are running The Control server should be able to assert that a given set of clients are running, determined by a config file somewhere. Re-/Auto-start them if necessary.
Time remaining: 0md
Ingest, Streaming filter framework – port legacy filters (Punted from 1.0) Port legacy filters to the new Ingest Streaming Filter Framework. The missing part here is most importantly porting of our legacy MARC filter
Skills needed: Knowledge on Java streams, insight in existing ingest-code.
Remaining time: 0 md.
Rework the indexer to be capable of handling iterative updates properly. The Searcher still needs to monitor the timestamp of the index and react to updates. Also deletes has not received proper testing
Skills needed: Extensive knowledge on the existing index-code, knowledge on Lucene indexing.
Remaining time: 0 md.
Update the core mapping functionality in the Facet Browser module.
Skills needed: Extensive knowledge on the Facet Browser module workings, experience with bit-tweaks and speed-optimization in Java, knowledge on the Lucene index format
Remaining time: 0 md.
Skills needed: Refactoring. Old Summa codebase. Ingest and Index module structures.
Remaining time: 0md
Skills needed: Being able to cope with boring work
Time remaining: 2 md
FileWatcher and StorageWatcher for automated workflow Skills needed: Storage, Ingest, Index, and Filter knowledge
Time remaining: 0 md
(
)
Daemon to clear a base and start ingest/index workflow We need a small daemon talking to the Storage, Ingest, and Index services (via their Service interfaces) to orchestrate the workflow for targets requiring non-iterative indexing
Skills needed: Control, Client/Service, Storage, Ingest, Index
Time remaining: ½ md
Specific SB Tasks for Summa 1.1
We need to migrate the metadata and XSLTs from the old production system
Remianing time: Not estimated
Summa 1.2
Summa 1.2 is scheduled for release 2008-11-14.
Overall Summa 1.2 Goals
- Preparation for general sysadmins and maintenance staff deployments (ie. not hacker-only deployments)
- Enabling of distributed searches
Summa 1.2 Tasks
Implement a distributed searcher. This is quite easy to do in our new design, so not as bad as it sounds. Documentation of the !SummaSearcher/!SearchNode split should also be clarified
Skills needed: Design and inner workings of the Search module
Remaining time: 1 md
Full system documentation (for maintenance and sysadmins) We need full documentation for sysadmins and maintenance staff on a level so that they can run and deploy Summa without our help or intervention
Skills needed: Summa overview, relation to IT maintenance at SB
Remaining time: 2 md
We need filters to do muxing and demuxing of multiple "streams". This is essential so that we can easily apply different XSLTs to recorsd from different bases
Skills needed: Summa filters
Time remaining: 0md
Bug fixing on full-scale test deployment Initiating full-scale tests of the 1.1/1.2 Summa releases will probably stamp out some bugs. Fix them!
Skills needed: Summa from top to bottom
Time remaining: 3md
Specific SB Tasks for Summa 1.2
Maintenance, Statistics and Logs Create an easy way for system adminstration to monitor logs and statistics
Skills needed: Log4J/commons-logging, log aggregation and data mining
Remaining time: 2 md
Create scripts for system admins to easily control the whole Summa cluster
Skills needed: Summa distributed architecture, Bash scripts
Remaining time: 2 md
SB specific monitoring integration into "Big Sister"
Skills needed: Big Sister, bash scripts
Remaining time: 2 md
Expected state after 1.2:
Summa 1.3
Summa 1.3 was released 2009-03-27.
Overall Summa 1.3 Goals
- Make Summa production ready for the State and University Library
Summa 1.3 Tasks
- Rewrite parts better performance: Storage, analyzers
- Bug fixing on full-scale deployment
- Optimize facet building
Summa 1.3.1
Summa 1.3.1 was released on 2009-04-16.
Overall Summa 1.3.1 Goals
- Minor fixes and features
- Optimizations
Summa 1.3.1 Tasks
- Optimize query parsing by using the clonable replace readers from the (not yet released) SBUtil 0.4.4.
Add basic MoreLikeThis support.
Summa 1.3.2
Summa 1.3.2 was released on 2009-04-20.
Overall Summa 1.3.2 Goals
- Minor fixes and features
Summa 1.3.2 Tasks
- Improve administrative tools
Summa 1.3.3
Summa 1.3.3 was released on 2009-04-20.
Overall Summa 1.3.3 Goals
- Minor fixes
Summa 1.4.0
Summa 1.4.0 was released on 2009-04-24.
Overall Summa 1.4.0 Goals
- Minor fixes and features
- Cleanup
Summa 1.4.0 Tasks
- Add Suggestion engine
- Improve Windows support scripts
Add MoreLikeThis support to the Lucene-based search node
Summa 1.5
Summa 1.5 is expected to ship 2009-06-01.
Overall Summa 1.5 Goals
- Fix storage backends. As noted in the release notes of 1.3.0 the Postgres and Derby backends are broken
- Distribution and synchronization of term statistics in place
- The whole distributed architecture is fully functional
Summa 1.5 Tasks
(Punted from 1.0) Write a distributer that moves files from folders to installations. This depends on Roadmap/FilePushingFramework
Skills needed: Knowledge on Score, basic knowledge on ingesters.
Remaining time: 4 md.
Efficient way to push files between machines programmatically (Punted from 1.0) The distributed nature of Summa requires us to have an efficient programmatic way to push files around between Control, Clients, and Services.
Skills needed: Distributed computing, Java IO. Possibly RMI and the RPC framework in sbutil.
Time remaining: 6md
Write code to extract, recalculate and store term statistics.
Skills needed: Knowledge on Lucene index format, knowledge on String-handling, distributed merge/split algorithms and general optimizing of large-corpus operations.
Remaining time: 10 md.
Summa 1.6
Summa 1.6 does not have an estimated release date yet.
Overall Summa 1.6 Goals
- Revive Did-You-Mean
Summa 1.6 Tasks
Distributed did-you-mean service The old did-you-mean code is not compatible with the new framework. It needs a monolithic index to compile a DYM index from. Update the DidYouMean module to the current framework.
Remaining time: Not estimated
Summa 1.7
Summa 1.7 does not have an estimated release date yet.
Overall Summa 1.7 Goals
- Revive the cluster extraction framework from the old Summa
- Duplicate reduction across data sources
Summa 1.7 Tasks
Implement a search that only returns unique documents, where unique is defined as title+author or similar.
Skills needed: Extensive knowledge on Facet Browser.
Remaining time: 8 md.
Targeted for Summa 1.1.
Implement distributed cluster analysis.
Skills needed: Extensive knowledge on the existing Cluster module, knowledge on file and folder handling under Score and Clients.
Remaining time: 3 md.
Implement hook-up code to the indexer that enriches documents with cluster information.
Skills needed: Knowledge on the workings of Cluster and Indexer.
Remaining time: 4 md.
Change cluster extraction from index based to storage based.
Skills needed: Knowledge on the workings of Cluster and Indexer.
Remaining time: 8 md.
Expected state after 1.7: Distributed cluster analysis and assignment works. The result can be viewed through the Facet Browser, again with unit-tests.
Unscheduled Tasks
It would be nice to be able to integrate the plethora of filters shipped with Solr.
Skills needed: Lucene, Solr, Summa Filters
Remaining time: NA
A filter to stream payloads through stdin and stdout of an external process A Filter calling into an external application (think shell script, Perl, Python, etc) piping the filter payload through stdin and stdout
Skills needed: Filter, ProcessRunner from sbutil
Time remaining:
The DidYouMean service is slow and could need some optimization. This task is not on the official roadmap.
Skills needed: Lucene, webservices
Remaining time: NA
Implement a service to find related items given an id.
Skills needed: Knowledge on Search, knowledge on existing Similar Items code, knowledge on distribution.
Remaining time: 2 md.
The codes such as Dewey classification numbers should be translated to human readable labels. This task is not part of the official roadmap.
Skills needed: Politics, Library classification systems, Facets
Remaining time: NA
Language insensitive keyword matching in searches. This task is not a part of the official roadmap.
Skills needed: Library classification systems, Summa facets and search
Remaining time: NA
Merge keywords from different classification systems in the same cluster and facet. This task is not part of the official roadmap.
Skills needed: Library classification systems, Summa facets and cluster analysis
Remaining time: NA
Match search terms cross languages. This task is not a part of the official roadmap.
Skills needed: Automated translation, Lucene, Summa search
Remaining time: NA
User centroids, user specific ranking Use dynamic centroids for cluster matching reflecting the preferences of the individual user. This needs website integration. This task is not a part of the official roadmap.
Skills needed: Summa cluster analysis, Webservices
Remaining time: NA
Derby server driver integration – external storage in teractionAllow the Derby metadata storage to use the out-of-process Derby driver (ie. the stand alone Derby server). This task is not a part of the official roadmap.
Skills needed: Derby, Summa metadata storage
Remaining time: NA
Extra computing resources from distributed pool of workers Integrate the Summa Dice module to allow heavy tasks such as index bootstrapping to be done by an external cluster of Dice workers. This task is not a part of the official roadmap.
Skills needed: Summa Dice knowledge
Remaining time: NA
Port Summa to Another RPC System Take advantage of the RPC abstraction layer in Summa and write RPC backends for other RPC systems.
Skills needed:
Remaining time:
On-the-fly metadata enrichment/restriction Also known as "Rights Integration". Allow to add or remove metadata on the fly as well as restrict the result space of searches. This is a huge task with ramifications into many parts of the code that can significantly change the way one interacts with Summa. This is a possible Summa 2.0 target.
Skills needed: Summa MetadataStorage, Search, Ingest, and Facet knowledge. Knowledge of stakeholders with specific needs for rights management and enrichment possibilities.
Remaining time: NA
An ingest filter capable of handling full dumps with minimal index changes.
Skills needed: Summa Storage and Ingest.
Remaining time: 1 md
Declarative Representation Language A simple, expressive, rule based language for formatting data, both for indexing and display.
Take advantage of Summa's modular nature and write an alternative indexing backend based on Zebra (as opposed to the current default backend which is Lucene).
Tools to control the running index process It would be nice for sysadmins to be able to force a complete re-index as well as performing a consolidate on the running index.
Sub Pages
